Occasional/recurring Cassandra OutOfMemory exceptions continue, the result of issues discussed in T144431: RESTBase k-r-v as Cassandra anti-pattern. With updates now happening in codfw, the OOMs have been isolated there where their impact is not felt on client reads, but we should continue to document them. Rather than to continue to open a new phabricator issue each time, let's use this single issue to keep a running log of them.
OutOfMemory exceptions
Time | Instance | Heapdump | Comments |
---|---|---|---|
2017-03-16T20:44:14 | restbase2001-c | Restarted by Puppet @ ~2017-03-16T21:08:14 | |
2017-03-24T12:49:59 | restbase2001-a | Restarted by puppet, can't recover org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log | |
2017-03-24T12:50:33 | restbase2009-b | Restarted by puppet | |
2017-03-27T07:47:26 | restbase2012-b | Restarted by Puppet @ 2017-03-27T08:13:36 | |
2017-03-30T15:47:15 | restbase2004-a | Manually restarted, back up @ ~2017-03-30T15:51:15 | |
2017-03-30T15:33:45 | restbase2010-c | Manually restarted (3 times); Back up @ ~2017-03-30T15:56:55 | |
2017-04-01T01:41:35 | restbase2004-b | Restarted @ ~2017-04-01T02:02:35 | |
2017-04-02T01:42:25 | restbase2005-c | Restarted @ 2017-04-02T01:43:25 | |
2017-04-02T03:28:25 | restbase2001-a | 5 events total; Resolved @ ~2017-04-02T05:37:35 | |
2017-04-02T03:38:25 | restbase2009-a | 4 events total; Resolved @ ~2017-04-02T05:34:25 | |
2017-04-11T06:42:58 | restbase2004-a | Resolved @ ~2017-04-11T06:58:58 by @MoritzMuehlenhoff | |
2017-04-12T11:42:28 | restbase2007-c | Resolved @ ~2017-04-12T11:51:28 by @elukey | |
2017-04-16T18:56:43 | restbase2007-c | ??? | Resolved @ 19:12:43 |
2017-04-17T04:41:54 | restbase2004-b | 8 OOMs total from 04:41:54 to 08:55:54; Resolved @ 09:31:54 | |
2017-04-17T04:39:04 | restbase2009-c | 8 OOMs total from 04:39:04 to 09:11:04; Resolved @ 09:34:04 | |
2017-04-19T11:29:17 | restbase2010-b | /srv/cassandra-b/java_pid47705.hprof | See: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase |
2017-04-19T11:30:17 | restbase2005-c | /srv/cassandra-c/java_pid31308.hprof | See: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase |
2017-04-20T05:15:36 | restbase1016-a | Puppet restarted; Resolved @ 05:38:36 | |
2017-04-29T13:36:00 | restbase1009-a | @elukey restarted, 2017-04-30T07:46:00 (corrupt commitlog segment prevented Puppet restart) | |
2017-04-29T13:40:00 | restbase1013-a | Puppet restarted; Resolved @ 14:06:00 | |
2017-04-30T13:10:50 | restbase1009-a | 8 events total; Resolved @ (after @elukey lowered tombstone_threshold) | |
2017-04-30T22:13:00 | restbase1015-c | Puppet restarted; Resolved @ 22:39:00 | |
2017-05-03T18:48:08 | restbase1014-c | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T00:49:20 | restbase1015-a | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T00:49:40 | restbase1007-b | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T00:46:40 | restbase1012-a | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:19:00 | restbase1013-b | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:19:00 | restbase1014-b | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:49:30 | restbase1008-b | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:50:30 | restbase1015-c | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:57:30 | restbase1011-a | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T01:59:30 | restbase1008-a | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-04T02:04:20 | restbase1016-c | Mitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans) | |
2017-05-25T20:30:56 | restbase2006-b | Restarted by Puppet, (twice, up @ 21:06:56) | |
2017-06-10T23:15:00 | restbase2006-c | /srv/cassandra-a/java_pid2377.hprof | Restarted by Puppet @ 2017-06-10T23:37:00 |
2017-07-09T07:47:00 | restbase2007-a | /srv/cassandra-a/java_pid2679.hprof | Restarted by Puppet @ 2017-07-09T08:05:00 |
2017-07-09T10:18:00 | restbase2012-c | /srv/cassandra-c/java_pid2350.hprof | Restarted by Puppet @ 2017-07-09T10:25:00 |
2017-07-13T16:16:00 | restbase2007-a | /srv/cassandra-a/java_pid4672.hprof | Restarted by @Eevans @ 2017-07-13T16:19:00 |
Mitigation
When repeated OOM exceptions occur, it may be possible to mitigate them by lowering the tombstone_failure_threshold value temporarily. The following snippet (untested in production) should do this. Run it on each host with an OOMing instance to lower the threshold to 1000 tombstones. An alternative threshold can be specified as an argument to the script. To restore the default threshold later, use:
$ ./tombstone_threshold_failure.sh `uyaml /etc/cassandra-a/cassandra.yaml /tombstone_failure_threshold`