Occasional/recurring Cassandra OutOfMemory exceptions continue, the result of issues discussed in {T144431}. With updates now happening in codfw, the OOMs have been isolated there where their impact is not felt on client reads, but we should continue to document them. Rather than to continue to open a new phabricator issue each time, let's use this single issue to keep a running log of them.
== `OutOfMemory` exceptions ==
| Time | Instance | Heapdump | Comments |
|-------|-------|------|------|
| 2017-03-16T20:44:14 | restbase2001-c | ~~/srv/cassandra-c/java_pid6856.hprof~~ | Restarted by Puppet @ ~2017-03-16T21:08:14 |
| 2017-03-24T12:49:59 | restbase2001-a | ~~/srv/cassandra-a/java_pid3678.hprof~~ | Restarted by puppet, can't recover `org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log`
| 2017-03-24T12:50:33 | restbase2009-b | ~~/srv/cassandra-b/java_pid2467.hprof~~ | Restarted by puppet
| 2017-03-27T07:47:26 | restbase2012-b | ~~/srv/cassandra-b/java_pid33443.hprof~~ | Restarted by Puppet @ 2017-03-27T08:13:36 |
| 2017-03-30T15:47:15 | restbase2004-a | ~~/srv/cassandra-a/java_pid28335.hprof~~ | Manually restarted, back up @ ~2017-03-30T15:51:15 |
| 2017-03-30T15:33:45 | restbase2010-c | ~~/srv/cassandra-c/java_pid52532.hprof /srv/cassandra-c/java_pid67967.hprof /srv/cassandra-c/java_pid75083.hprof~~ | Manually restarted (3 times); Back up @ ~2017-03-30T15:56:55 |
| 2017-04-01T01:41:35 | restbase2004-b | ~~/srv/cassandra-b/java_pid814.hprof~~ | Restarted @ ~2017-04-01T02:02:35 |
| 2017-04-02T01:42:25 | restbase2005-c | ~~/srv/cassandra-c/java_pid10559.hprof~~ | Restarted @ 2017-04-02T01:43:25 |
| 2017-04-02T03:28:25 | restbase2001-a| ~~/srv/cassandra-a/java_pid5021.hprof /srv/cassandra-a/java_pid2347.hprof /srv/cassandra-a/java_pid26573.hprof /srv/cassandra-a/java_pid28144.hprof /srv/cassandra-a/java_pid17332.hprof~~ | 5 events total; Resolved @ ~2017-04-02T05:37:35 |
| 2017-04-02T03:38:25 | restbase2009-a| ~~/srv/cassandra-a/java_pid24320.hprof /srv/cassandra-a/java_pid12720.hprof /srv/cassandra-a/java_pid6210.hprof /srv/cassandra-a/java_pid2131.hprof~~ | 4 events total; Resolved @ ~2017-04-02T05:34:25 |
| 2017-04-11T06:42:58 | restbase2004-a | ~~/srv/cassandra-a/java_pid14987.hprof~~ | Resolved @ ~2017-04-11T06:58:58 by @MoritzMuehlenhoff |
| 2017-04-12T11:42:28 | restbase2007-c | ~~/srv/cassandra-c/java_pid26332.hprof~~ | Resolved @ ~2017-04-12T11:51:28 by @elukey |
| 2017-04-16T18:56:43 | restbase2007-c | ??? | Resolved @ 19:12:43 |
| 2017-04-17T04:41:54 | restbase2004-b | ~~/srv/cassandra-b/java_pid13433.hprof /srv/cassandra-b/java_pid14702.hprof /srv/cassandra-b/java_pid14780.hprof /srv/cassandra-b/java_pid19846.hprof /srv/cassandra-b/java_pid20876.hprof /srv/cassandra-b/java_pid28379.hprof /srv/cassandra-b/java_pid2963.hprof /srv/cassandra-b/java_pid3036.hprof~~ | 8 OOMs total from 04:41:54 to 08:55:54; Resolved @ 09:31:54 |
| 2017-04-17T04:39:04 | restbase2009-c | ~~/srv/cassandra-c/java_pid10730.hprof /srv/cassandra-c/java_pid13145.hprof /srv/cassandra-c/java_pid15859.hprof /srv/cassandra-c/java_pid19221.hprof /srv/cassandra-c/java_pid20321.hprof /srv/cassandra-c/java_pid2322.hprof /srv/cassandra-c/java_pid2485.hprof /srv/cassandra-c/java_pid26821.hprof~~ | 8 OOMs total from 04:39:04 to 09:11:04; Resolved @ 09:34:04 |
| 2017-04-19T11:29:17 | restbase2010-b | /srv/cassandra-b/java_pid47705.hprof | See: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase |
| 2017-04-19T11:30:17 | restbase2005-c | /srv/cassandra-c/java_pid31308.hprof | See: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase |
| 2017-04-20T05:15:36 | restbase1016-a | ~~/srv/cassandra-a/java_pid2439.hprof~~ | Puppet restarted; Resolved @ 05:38:36 |
| 2017-04-29T13:36:00 | restbase1009-a | ~~/srv/cassandra-a/java_pid41816.hprof~~ | @elukey restarted, 2017-04-30T07:46:00 (corrupt commitlog segment prevented Puppet restart) |
| 2017-04-29T13:40:00 | restbase1013-a | ~~/srv/cassandra-a/java_pid24895.hprof~~ | Puppet restarted; Resolved @ 14:06:00 |
| 2017-04-30T13:10:50 | restbase1009-a | ~~/srv/cassandra-a/java_pid10476.hprof /srv/cassandra-a/java_pid127771.hprof /srv/cassandra-a/java_pid138155.hprof /srv/cassandra-a/java_pid1605.hprof /srv/cassandra-a/java_pid16067.hprof /srv/cassandra-a/java_pid30333.hprof /srv/cassandra-a/java_pid45499.hprof /srv/cassandra-a/java_pid55310.hprof~~ | 8 events total; Resolved @ (after @elukey lowered `tombstone_threshold`) |
| 2017-04-30T22:13:00 | restbase1015-c | ~~/srv/cassandra-c/java_pid18068.hprof~~ | Puppet restarted; Resolved @ 22:39:00 |
----
== Mitigation ==
When repeated OOM exceptions occur, it may be possible to mitigate them by lowering the `tombstone_failure_threshold` value temporarily. The following snippet (untested in production) should do this. Run it on each host with an OOMing instance to lower the threshold to 1000 tombstones. An alternative threshold can be specified as an argument to the script. To restore the default threshold later, use:
```
$ ./tombstone_threshold_failure.sh `uyaml /etc/cassandra-a/cassandra.yaml /tombstone_failure_threshold`
```
{P5165}
NOTE: Tombstone warnings in the logs (preceding the OOM) might inform a better threshold value