Page MenuHomePhabricator

Cassandra OOMs on restbase200{4,5}.codfw.wmnet
Closed, ResolvedPublic

Description

At shortly after 13:00 on 2017-01-31, two Cassandra instances OOM'd on restbase2004-b and restbase2005-b.

$ cdsh -d codfw -- "sudo find /srv/cassandra-* -maxdepth 1 -name '*.hprof'"
restbase2004.codfw.wmnet: /srv/cassandra-b/java_pid16503.hprof
restbase2005.codfw.wmnet: /srv/cassandra-b/java_pid27082.hprof
$

It's reasonably to assume this is no different than the other recent events (see T153588 and T156155), and so it probably doesn't warrant further investigation (this ticket is primary intended to document/acknowledge the occurrence). I will leave this issue open (and the heap dumps in place) for a few days in case anyone else has questions (or if some spare cycles become available).

NOTE: Icinga registered an alert for 2001-a at ~10:00 UTC as well, but that was an administrative shutdown (due to on-going OpenJDK upgrades).
ACTION: These heap dumps should be cleaned up before closing this issue

Event Timeline

Eevans edited projects, added Services (done); removed Services (doing).
Eevans updated the task description. (Show Details)