Page MenuHomePhabricator

Cassandra OOMs
Closed, ResolvedPublic

Description

Occasional/recurring Cassandra OutOfMemory exceptions continue, the result of issues discussed in T144431: RESTBase k-r-v as Cassandra anti-pattern. With updates now happening in codfw, the OOMs have been isolated there where their impact is not felt on client reads, but we should continue to document them. Rather than to continue to open a new phabricator issue each time, let's use this single issue to keep a running log of them.

OutOfMemory exceptions

TimeInstanceHeapdumpComments
2017-03-16T20:44:14restbase2001-c/srv/cassandra-c/java_pid6856.hprofRestarted by Puppet @ ~2017-03-16T21:08:14
2017-03-24T12:49:59restbase2001-a/srv/cassandra-a/java_pid3678.hprofRestarted by puppet, can't recover org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log
2017-03-24T12:50:33restbase2009-b/srv/cassandra-b/java_pid2467.hprofRestarted by puppet
2017-03-27T07:47:26restbase2012-b/srv/cassandra-b/java_pid33443.hprofRestarted by Puppet @ 2017-03-27T08:13:36
2017-03-30T15:47:15restbase2004-a/srv/cassandra-a/java_pid28335.hprofManually restarted, back up @ ~2017-03-30T15:51:15
2017-03-30T15:33:45restbase2010-c/srv/cassandra-c/java_pid52532.hprof /srv/cassandra-c/java_pid67967.hprof /srv/cassandra-c/java_pid75083.hprofManually restarted (3 times); Back up @ ~2017-03-30T15:56:55
2017-04-01T01:41:35restbase2004-b/srv/cassandra-b/java_pid814.hprofRestarted @ ~2017-04-01T02:02:35
2017-04-02T01:42:25restbase2005-c/srv/cassandra-c/java_pid10559.hprofRestarted @ 2017-04-02T01:43:25
2017-04-02T03:28:25restbase2001-a/srv/cassandra-a/java_pid5021.hprof /srv/cassandra-a/java_pid2347.hprof /srv/cassandra-a/java_pid26573.hprof /srv/cassandra-a/java_pid28144.hprof /srv/cassandra-a/java_pid17332.hprof5 events total; Resolved @ ~2017-04-02T05:37:35
2017-04-02T03:38:25restbase2009-a/srv/cassandra-a/java_pid24320.hprof /srv/cassandra-a/java_pid12720.hprof /srv/cassandra-a/java_pid6210.hprof /srv/cassandra-a/java_pid2131.hprof4 events total; Resolved @ ~2017-04-02T05:34:25
2017-04-11T06:42:58restbase2004-a/srv/cassandra-a/java_pid14987.hprofResolved @ ~2017-04-11T06:58:58 by @MoritzMuehlenhoff
2017-04-12T11:42:28restbase2007-c/srv/cassandra-c/java_pid26332.hprofResolved @ ~2017-04-12T11:51:28 by @elukey
2017-04-16T18:56:43restbase2007-c???Resolved @ 19:12:43
2017-04-17T04:41:54restbase2004-b/srv/cassandra-b/java_pid13433.hprof /srv/cassandra-b/java_pid14702.hprof /srv/cassandra-b/java_pid14780.hprof /srv/cassandra-b/java_pid19846.hprof /srv/cassandra-b/java_pid20876.hprof /srv/cassandra-b/java_pid28379.hprof /srv/cassandra-b/java_pid2963.hprof /srv/cassandra-b/java_pid3036.hprof8 OOMs total from 04:41:54 to 08:55:54; Resolved @ 09:31:54
2017-04-17T04:39:04restbase2009-c/srv/cassandra-c/java_pid10730.hprof /srv/cassandra-c/java_pid13145.hprof /srv/cassandra-c/java_pid15859.hprof /srv/cassandra-c/java_pid19221.hprof /srv/cassandra-c/java_pid20321.hprof /srv/cassandra-c/java_pid2322.hprof /srv/cassandra-c/java_pid2485.hprof /srv/cassandra-c/java_pid26821.hprof8 OOMs total from 04:39:04 to 09:11:04; Resolved @ 09:34:04
2017-04-19T11:29:17restbase2010-b/srv/cassandra-b/java_pid47705.hprofSee: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase
2017-04-19T11:30:17restbase2005-c/srv/cassandra-c/java_pid31308.hprofSee: https://wikitech.wikimedia.org/wiki/Incident_documentation/20170419-restbase
2017-04-20T05:15:36restbase1016-a/srv/cassandra-a/java_pid2439.hprofPuppet restarted; Resolved @ 05:38:36
2017-04-29T13:36:00restbase1009-a/srv/cassandra-a/java_pid41816.hprof@elukey restarted, 2017-04-30T07:46:00 (corrupt commitlog segment prevented Puppet restart)
2017-04-29T13:40:00restbase1013-a/srv/cassandra-a/java_pid24895.hprofPuppet restarted; Resolved @ 14:06:00
2017-04-30T13:10:50restbase1009-a/srv/cassandra-a/java_pid10476.hprof /srv/cassandra-a/java_pid127771.hprof /srv/cassandra-a/java_pid138155.hprof /srv/cassandra-a/java_pid1605.hprof /srv/cassandra-a/java_pid16067.hprof /srv/cassandra-a/java_pid30333.hprof /srv/cassandra-a/java_pid45499.hprof /srv/cassandra-a/java_pid55310.hprof8 events total; Resolved @ (after @elukey lowered tombstone_threshold)
2017-04-30T22:13:00restbase1015-c/srv/cassandra-c/java_pid18068.hprofPuppet restarted; Resolved @ 22:39:00
2017-05-03T18:48:08restbase1014-c/srv/cassandra-c/java_pid24844.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T00:49:20restbase1015-a/srv/cassandra-a/java_pid22063.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T00:49:40restbase1007-b/srv/cassandra-b/java_pid2367.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T00:46:40restbase1012-a/srv/cassandra-a/java_pid10785.hprof /srv/cassandra-a/java_pid13772.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:19:00restbase1013-b/srv/cassandra-b/java_pid22815.hprof /srv/cassandra-b/java_pid29447.hprof /srv/cassandra-b/java_pid31034.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:19:00restbase1014-b/srv/cassandra-b/java_pid20462.hprof /srv/cassandra-b/java_pid26814.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:49:30restbase1008-b/srv/cassandra-b/java_pid120757.hprof /srv/cassandra-b/java_pid26084.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:50:30restbase1015-c/srv/cassandra-c/java_pid10336.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:57:30restbase1011-a/srv/cassandra-a/java_pid17616.hprof /srv/cassandra-a/java_pid2366.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T01:59:30restbase1008-a/srv/cassandra-a/java_pid116404.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-04T02:04:20restbase1016-c/srv/cassandra-c/java_pid2423.hprof /srv/cassandra-c/java_pid30887.hprof /srv/cassandra-c/java_pid40078.hprofMitigated w/ tombstone_failure_threshold and blacklisting (@mobrovac & @Eevans)
2017-05-25T20:30:56restbase2006-b/srv/cassandra-b/java_pid2434.hprof /srv/cassandra-b/java_pid9933.hprofRestarted by Puppet, (twice, up @ 21:06:56)
2017-06-10T23:15:00restbase2006-c/srv/cassandra-a/java_pid2377.hprofRestarted by Puppet @ 2017-06-10T23:37:00
2017-07-09T07:47:00restbase2007-a/srv/cassandra-a/java_pid2679.hprofRestarted by Puppet @ 2017-07-09T08:05:00
2017-07-09T10:18:00restbase2012-c/srv/cassandra-c/java_pid2350.hprofRestarted by Puppet @ 2017-07-09T10:25:00
2017-07-13T16:16:00restbase2007-a/srv/cassandra-a/java_pid4672.hprofRestarted by @Eevans @ 2017-07-13T16:19:00

Mitigation

When repeated OOM exceptions occur, it may be possible to mitigate them by lowering the tombstone_failure_threshold value temporarily. The following snippet (untested in production) should do this. Run it on each host with an OOMing instance to lower the threshold to 1000 tombstones. An alternative threshold can be specified as an argument to the script. To restore the default threshold later, use:

$ ./tombstone_threshold_failure.sh `uyaml /etc/cassandra-a/cassandra.yaml /tombstone_failure_threshold`

1#!/bin/bash
2
3set -e
4
5THRESHOLD=${1:-1000}
6
7SJK=/usr/bin/sjk
8MBEAN="org.apache.cassandra.db:type=StorageService"
9ATTRIBUTE="TombstoneFailureThreshold"
10
11
12name()
13{
14 uyaml "$1" /name
15}
16
17jmx_port()
18{
19 uyaml "$1" /jmx_port
20}
21
22
23for i in `ls /etc/cassandra-instances.d/*.yaml`; do
24 echo "Setting tombstone_failure_threshold=$THRESHOLD on instance `name $i`"
25 $SJK mx -s localhost:`jmx_port $i` -b $MBEAN -ms -f $ATTRIBUTE -v $THRESHOLD >/dev/null
26done

NOTE: Tombstone warnings in the logs (preceding the OOM) might inform a better threshold value

Event Timeline

Eevans edited projects, added Services (doing); removed Services.
Eevans updated the task description. (Show Details)

Deleted /srv/cassandra-c/java_pid6856.hprof on restbase2001 to reclaim space.

fgiunchedi subscribed.

This happened again today on restbase2001-a, though cassandra fails at startup with

INFO  [main] 2017-03-24 13:06:01,929 CommitLog.java:168 - Replaying /srv/cassandra-a/commitlog/CommitLog-5-1489701224337.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224338.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224340.lo
g, /srv/cassandra-a/commitlog/CommitLog-5-1489701224345.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224346.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224347.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224348.log, /srv
/cassandra-a/commitlog/CommitLog-5-1489701224349.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224350.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224353.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224357.log, /srv/cassan
dra-a/commitlog/CommitLog-5-1489701224358.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224361.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224362.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224364.log, /srv/cassandra-a/c
ommitlog/CommitLog-5-1489701224365.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224366.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224367.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224368.log, /srv/cassandra-a/commitlo
g/CommitLog-5-1489701224369.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224370.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224371.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224372.log, /srv/cassandra-a/commitlog/Commi
tLog-5-1489701224373.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224374.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224376.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224377.log, /srv/cassandra-a/commitlog/CommitLog-5-
1489701224378.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224379.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224380.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224381.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701
224382.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224383.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224384.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224385.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224386.
log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224387.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224388.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224389.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224390.log, /s
rv/cassandra-a/commitlog/CommitLog-5-1489701224391.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224392.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224393.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224394.log, /srv/cass
andra-a/commitlog/CommitLog-5-1489701224395.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224396.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224397.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224398.log, /srv/cassandra-a
/commitlog/CommitLog-5-1489701224399.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224400.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224401.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224402.log, /srv/cassandra-a/commit
log/CommitLog-5-1489701224403.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224404.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224405.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224406.log, /srv/cassandra-a/commitlog/Com
mitLog-5-1489701224407.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224408.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224409.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224410.log, /srv/cassandra-a/commitlog/CommitLog-
5-1489701224411.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224412.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224413.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224414.log, /srv/cassandra-a/commitlog/CommitLog-5-14897
01224415.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224416.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224417.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224418.log, /srv/cassandra-a/commitlog/CommitLog-5-148970122441
9.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224420.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224421.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224422.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224423.log, 
/srv/cassandra-a/commitlog/CommitLog-5-1489701224424.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224425.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224426.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224427.log, /srv/ca
ssandra-a/commitlog/CommitLog-5-1489701224428.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224429.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224430.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224431.log, /srv/cassandra
-a/commitlog/CommitLog-5-1489701224432.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224433.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224434.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224435.log, /srv/cassandra-a/comm
itlog/CommitLog-5-1489701224436.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224437.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224438.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224439.log, /srv/cassandra-a/commitlog/C
ommitLog-5-1489701224440.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224441.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224442.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224443.log, /srv/cassandra-a/commitlog/CommitLo
g-5-1489701224444.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224445.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224446.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224447.log, /srv/cassandra-a/commitlog/CommitLog-5-148
9701224448.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224449.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224450.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224451.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224
452.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224453.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224454.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224455.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224456.log
, /srv/cassandra-a/commitlog/CommitLog-5-1489701224457.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224458.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224459.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224460.log, /srv/
cassandra-a/commitlog/CommitLog-5-1489701224461.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224462.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224463.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224464.log, /srv/cassand
ra-a/commitlog/CommitLog-5-1489701224465.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224466.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224467.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224468.log, /srv/cassandra-a/co
mmitlog/CommitLog-5-1489701224469.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224470.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224471.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224472.log, /srv/cassandra-a/commitlog
/CommitLog-5-1489701224473.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224474.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224475.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224476.log, /srv/cassandra-a/commitlog/Commit
Log-5-1489701224477.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224478.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224479.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224480.log, /srv/cassandra-a/commitlog/CommitLog-5-1
489701224481.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224482.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224483.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224484.log, /srv/cassandra-a/commitlog/CommitLog-5-14897012
24485.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224486.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224487.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224488.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224489.l
og, /srv/cassandra-a/commitlog/CommitLog-5-1489701224490.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224491.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224492.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224493.log, /sr
v/cassandra-a/commitlog/CommitLog-5-1489701224494.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224495.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224496.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224497.log, /srv/cassa
ndra-a/commitlog/CommitLog-5-1489701224498.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224499.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224500.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224501.log, /srv/cassandra-a/
commitlog/CommitLog-5-1489701224502.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224503.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224504.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224505.log, /srv/cassandra-a/commitl
og/CommitLog-5-1489701224506.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224507.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224508.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224509.log, /srv/cassandra-a/commitlog/Comm
itLog-5-1489701224510.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224511.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224512.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224513.log, /srv/cassandra-a/commitlog/CommitLog-5
-1489701224514.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224515.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224516.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224517.log, /srv/cassandra-a/commitlog/CommitLog-5-148970
1224518.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224519.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224520.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224521.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224522
.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224523.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224524.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224525.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224526.log, /
srv/cassandra-a/commitlog/CommitLog-5-1489701224527.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224528.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224529.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224530.log, /srv/cas
sandra-a/commitlog/CommitLog-5-1489701224531.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224532.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224533.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224534.log, /srv/cassandra-
a/commitlog/CommitLog-5-1489701224535.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224536.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224537.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224538.log, /srv/cassandra-a/commi
tlog/CommitLog-5-1489701224539.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224540.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224541.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224542.log, /srv/cassandra-a/commitlog/Co
mmitLog-5-1489701224543.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224544.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224545.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224546.log, /srv/cassandra-a/commitlog/CommitLog
-5-1489701224547.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224548.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224549.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224550.log, /srv/cassandra-a/commitlog/CommitLog-5-1489
701224551.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224552.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224553.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224554.log, /srv/cassandra-a/commitlog/CommitLog-5-14897012245
55.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224556.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224557.log, /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log
ERROR [main] 2017-03-24 13:06:19,949 JVMStabilityInspector.java:78 - Exiting due to error while processing commit log during initialization.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:623) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:303) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.6.jar:2.2.6]

This happened again today on restbase2001-a, though cassandra fails at startup with

[ ... ]
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Could not read commit log descriptor in file /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleReplayError(CommitLogReplayer.java:623) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:303) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLogReplayer.recover(CommitLogReplayer.java:147) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:189) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:169) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:516) [apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.2.6.jar:2.2.6]

Weird, the file size of this commitlog segment is zero

-rw-r--r-- 1 cassandra cassandra 0 Mar 24 12:50 /srv/cassandra-a/commitlog/CommitLog-5-1489701224558.log

This shouldn't happen, and it's probably worth investigating why, but either way it is going to require removing this log to get Cassandra back up; I'll delete this and restart Cassandra.

Deleted /srv/cassandra-b/java_pid33443.hprof on 2012 to reclaim space.

Eevans updated the task description. (Show Details)

I did not have the opportunity to try this before the event subsided on its own, but in the future, temporarily setting tombstone_failure_threshold to a lower value might be an option.

Something like:

1#!/bin/bash
2
3set -e
4
5THRESHOLD=${1:-1000}
6
7SJK=/usr/bin/sjk
8MBEAN="org.apache.cassandra.db:type=StorageService"
9ATTRIBUTE="TombstoneFailureThreshold"
10
11
12name()
13{
14 uyaml "$1" /name
15}
16
17jmx_port()
18{
19 uyaml "$1" /jmx_port
20}
21
22
23for i in `ls /etc/cassandra-instances.d/*.yaml`; do
24 echo "Setting tombstone_failure_threshold=$THRESHOLD on instance `name $i`"
25 $SJK mx -s localhost:`jmx_port $i` -b $MBEAN -ms -f $ATTRIBUTE -v $THRESHOLD >/dev/null
26done

Deleted 2004:/srv/cassandra-a/java_pid28335.hprof, and 2010:/srv/cassandra-c/java_pid52532.hprof,/srv/cassandra-c/java_pid67967.hprof,/srv/cassandra-c/java_pid75083.hprof

Deleted:

  • 2004: /srv/cassandra-b/java_pid814.hprof
  • 2005: /srv/cassandra-b/java_pid814.hprof
  • 2001: /srv/cassandra-a/java_pid5021.hprof /srv/cassandra-a/java_pid2347.hprof /srv/cassandra-a/java_pid26573.hprof /srv/cassandra-a/java_pid28144.hprof /srv/cassandra-a/java_pid17332.hprof
  • 2009: /srv/cassandra-a/java_pid24320.hprof /srv/cassandra-a/java_pid12720.hprof /srv/cassandra-a/java_pid6210.hprof /srv/cassandra-a/java_pid2131.hprof

Mentioned in SAL (#wikimedia-operations) [2017-04-17T09:33:02Z] <marostegui> Silence alerts for restbase2004 and restbase2009 T160759

For the record:

˜/elukey 10:29> our dear cassandra on restbase2009/2004 (and sometimes 2007) keeps crashing for OOM, tracking task is https://phabricator.wikimedia.org/T160759
˜/elukey 10:29> this time seems a bit bad
˜/elukey 10:30> probably some there are requests for data replicated on 2009/2004/2007 that cause the problem
˜/elukey 10:30> (like causing too many tombstones scanned filling up the heap)
˜/elukey 10:55> I am going offline in a bit but this probably needs the dark arts of somebody from Services
˜/elukey 10:55> it is not impacting user reads (that are served in eqiad)

I have silenced those alerts until 16:30 UTC

Grep'ing for tombstone warnings, I don't see warnings for anything but mobile keyspaces, (and all of the warnings seem to be less than 2k tombstones). The past OOM events I've analyzed in depth were always the parsoid tables, and typically with tombstone counts closer to double this (I'm not saying that one of these couldn't be the culprit, just that I don't see a useful candidate for blacklisting).

It's also possible that these could have been caused by something that generated the OOM before the warning was issued (results fewer than the 1000 a warning is issued at). This would also be a little out of the ordinary.

At this time, both instances are up, so the updates that were triggering this seem to have passed for the time being. If they recur, we should try running P5165 to temporarily/ephemerally lower the tombstone failure threshold. And, if it manifests as two instances again, we should run it first on just one of them (it'd be interesting to know this solved it, rather than that the event simply subsided on its own).

I am AFK today and tomorrow, but I will try to check in on this periodically (and I will have my phone with me).

1local_group_wikimedia_T_mobileDSuCY0V_8QMmb7m5ao.data commons.wikimedia.org:User\:YLSS/BSicon/Stations_and_stops
2local_group_wikimedia_T_mobilewKT4BougZJL2msO9eA.data commons.wikimedia.org:User\:YLSS/BSicon/Stations_and_stops
3local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data de.wikipedia.org:State_Champs
4local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:Envelope_model
5local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:Template\:Admin_dashboard/uaarfpp
6local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Amakuru/Dashboard
7local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Basalisk/Dashboard
8local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Buffaboy/Dashpanel
9local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:CorbieVreccan/Admin_Toolbox
10local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Diannaa/Dashboard
11local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Doug_Weller/Admin_Dashboard
12local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Hydriz
13local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Jeraphine_Gryphon/sandbox3
14local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:L'Aquatique/dashboard
15local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Lid/templates
16local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:LinguistAtLarge/Today's_AfD
17local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Nancy/Desk
18local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:SilkTork/Dashboard
19local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Taroaldo
20local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:TheGeneralUser/Dashboard
21local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Willking1979/Admin_info
22local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Zink_Dawg/DashBoard
23local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data es.wikipedia.org:Antena_3
24local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data es.wikipedia.org:Telemadrid
25local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data es.wikipedia.org:Wikipedia\:Tablón_de_anuncios_de_los_bibliotecarios/Portal/Todo_el_tablón_actual
26local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data hy.wikipedia.org:Իտալերեն_Վիքիպեդիա
27local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data hy.wikipedia.org:Ճապոներեն_Վիքիպեդիա
28local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data hy.wikipedia.org:Պարսկերեն_Վիքիպեդիա
29local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data zh.wikipedia.org:User\:Liangent/dykc3
30local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data de.wikipedia.org:State_Champs
31local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:Envelope_model
32local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:Template\:Admin_dashboard/uaarfpp
33local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Amakuru/Dashboard
34local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Basalisk/Dashboard
35local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Buffaboy/Dashpanel
36local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:CorbieVreccan/Admin_Toolbox
37local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Diannaa/Dashboard
38local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Doug_Weller/Admin_Dashboard
39local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Hydriz
40local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Jeraphine_Gryphon/sandbox3
41local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:L'Aquatique/dashboard
42local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Lid/templates
43local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:LinguistAtLarge/Today's_AfD
44local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Nancy/Desk
45local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:SilkTork/Dashboard
46local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Taroaldo
47local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:TheGeneralUser/Dashboard
48local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Willking1979/Admin_info
49local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Zink_Dawg/DashBoard
50local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data es.wikipedia.org:Antena_3
51local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data es.wikipedia.org:Telemadrid
52local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data es.wikipedia.org:Wikipedia\:Tablón_de_anuncios_de_los_bibliotecarios/Portal/Todo_el_tablón_actual
53local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data hy.wikipedia.org:Իտալերեն_Վիքիպեդիա
54local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data hy.wikipedia.org:Պարսկերեն_Վիքիպեդիա
55local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data zh.wikipedia.org:User\:Liangent/dykc3
56local_group_wikipedia_T_summary.data es.wikipedia.org:Telemadrid

1local_group_wikimedia_T_mobileDSuCY0V_8QMmb7m5ao.data commons.wikimedia.org:Commons\:Deletion_requests/2017/04
2local_group_wikimedia_T_mobilewKT4BougZJL2msO9eA.data commons.wikimedia.org:Commons\:Deletion_requests/2017/04
3local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:Template\:Admin_dashboard/aiv
4local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:Template\:Admin_dashboard/testcases
5local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Ctwabn/start
6local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Excirial/Dashboard/Content
7local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:FlyingKiwi/Dash
8local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:GB_fan/Dashboard
9local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Jeraphine_Gryphon/sandbox3
10local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Keeper76/dashboard
11local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Martial75/desk
12local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Mojoworker
13local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Ocaasi
14local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Pldx1/Nolever
15local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Scott/Utilities/Administration
16local_group_wikipedia_T_mobileDSuCY0V_8QMmb7m5ao.data en.wikipedia.org:User\:Sun_Creator/AFD
17local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:Template\:Admin_dashboard/aiv
18local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:Template\:Admin_dashboard/testcases
19local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Ctwabn/start
20local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Excirial/Dashboard/Content
21local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:FlyingKiwi/Dash
22local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:GB_fan/Dashboard
23local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Jeraphine_Gryphon/sandbox3
24local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Keeper76/dashboard
25local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Martial75/desk
26local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Mojoworker
27local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Ocaasi
28local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Pldx1/Nolever
29local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Scott/Utilities/Administration
30local_group_wikipedia_T_mobilewKT4BougZJL2msO9eA.data en.wikipedia.org:User\:Sun_Creator/AFD

Deleted:

  • 2004: /srv/cassandra-a/java_pid14987.hprof
  • 2007: /srv/cassandra-c/java_pid26332.hprof

Mentioned in SAL (#wikimedia-operations) [2017-04-30T15:24:21Z] <elukey> set tombstone_failure_threshold=10000 to restbase1009-a with P5165 on restbase1009-a - T160759

Mentioned in SAL (#wikimedia-operations) [2017-04-30T15:31:47Z] <elukey> set tombstone_failure_threshold=1000 to restbase1009-a with P5165 on restbase1009-a - T160759

Mentioned in SAL (#wikimedia-operations) [2017-04-30T16:35:50Z] <urandom> T160759: Restoring default tombstone_threshold on restbase1009

Deleted:

  • 2004: /srv/cassandra-b/java_pid13433.hprof /srv/cassandra-b/java_pid14702.hprof /srv/cassandra-b/java_pid14780.hprof /srv/cassandra-b/java_pid19846.hprof /srv/cassandra-b/java_pid20876.hprof /srv/cassandra-b/java_pid28379.hprof /srv/cassandra-b/java_pid2963.hprof /srv/cassandra-b/java_pid3036.hprof
  • 2009: /srv/cassandra-c/java_pid10730.hprof /srv/cassandra-c/java_pid13145.hprof /srv/cassandra-c/java_pid15859.hprof /srv/cassandra-c/java_pid19221.hprof /srv/cassandra-c/java_pid20321.hprof /srv/cassandra-c/java_pid2322.hprof /srv/cassandra-c/java_pid2485.hprof /srv/cassandra-c/java_pid26821.hprof

Mentioned in SAL (#wikimedia-operations) [2017-05-03T18:39:58Z] <urandom> T160759: reducing tombstone threshold to 1000, restbase1013

Mentioned in SAL (#wikimedia-operations) [2017-05-03T18:46:24Z] <urandom> T160759: reducing tombstone threshold to 1000, restbase1016

Mentioned in SAL (#wikimedia-operations) [2017-05-03T18:48:30Z] <urandom> T160759: reducing tombstone threshold to 1000, restbase1014

Mentioned in SAL (#wikimedia-operations) [2017-05-03T19:18:04Z] <ppchelko@naos> Started deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759

Mentioned in SAL (#wikimedia-operations) [2017-05-03T19:25:43Z] <ppchelko@naos> Finished deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 (duration: 07m 39s)

Mentioned in SAL (#wikimedia-operations) [2017-05-03T19:26:10Z] <ppchelko@naos> Started deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 attempt #2 - checks timeout

Mentioned in SAL (#wikimedia-operations) [2017-05-03T19:27:49Z] <ppchelko@naos> Finished deploy [restbase/deploy@76d909f]: Blacklist a title to fix cassandra OOMs T160759 attempt #2 - checks timeout (duration: 01m 39s)

Mentioned in SAL (#wikimedia-operations) [2017-05-03T20:13:19Z] <urandom> T160759: restoring default tombstone thresholds, restbase10{3,4,6}

Mentioned in SAL (#wikimedia-operations) [2017-05-04T01:22:35Z] <urandom_> T160759: lowering tombstone_threshold on restbase1013 & restbase1014

Mentioned in SAL (#wikimedia-operations) [2017-05-04T02:00:14Z] <urandom> T160759: lowering tombstone threshold to 1000 on all eqiad nodes

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:03:40Z] <urandom> T160759: restoring default Cassandra tombstone_threshold in eqiad

Deleted:

  • 1016: /srv/cassandra-a/java_pid2439.hprof
  • 1009: /srv/cassandra-a/java_pid41816.hprof /srv/cassandra-a/java_pid10476.hprof /srv/cassandra-a/java_pid127771.hprof /srv/cassandra-a/java_pid138155.hprof /srv/cassandra-a/java_pid1605.hprof /srv/cassandra-a/java_pid16067.hprof /srv/cassandra-a/java_pid30333.hprof /srv/cassandra-a/java_pid45499.hprof /srv/cassandra-a/java_pid55310.hprof
  • 1013: /srv/cassandra-a/java_pid24895.hprof
  • 1015: /srv/cassandra-c/java_pid18068.hprof

Cleaning up;

  • 1014: /srv/cassandra-c/java_pid24844.hprof /srv/cassandra-b/java_pid20462.hprof /srv/cassandra-b/java_pid26814.hprof
  • 1015: /srv/cassandra-a/java_pid22063.hprof
  • 1007: /srv/cassandra-b/java_pid2367.hprof
  • 1012: /srv/cassandra-a/java_pid10785.hprof /srv/cassandra-a/java_pid13772.hprof
  • 1013: /srv/cassandra-b/java_pid22815.hprof /srv/cassandra-b/java_pid29447.hprof /srv/cassandra-b/java_pid31034.hprof

Cleaning up:

  • 1008: /srv/cassandra-b/java_pid120757.hprof /srv/cassandra-b/java_pid26084.hprof
  • 1015: /srv/cassandra-c/java_pid10336.hprof
  • 1011: /srv/cassandra-a/java_pid17616.hprof /srv/cassandra-a/java_pid2366.hprof
  • 1008: /srv/cassandra-a/java_pid116404.hprof
  • 1016: /srv/cassandra-c/java_pid2423.hprof /srv/cassandra-c/java_pid30887.hprof /srv/cassandra-c/java_pid40078.hprof
  • 2006: /srv/cassandra-b/java_pid2434.hprof /srv/cassandra-b/java_pid9933.hprof

We're no longer experiencing chronic OOM exceptions, and so should no longer need a ticket to document them. Closing.