Page MenuHomePhabricator

Investigate recent OOM events on restbase2004
Closed, ResolvedPublic

Description

Cassandra has exited abruptly on restbase2004.codfw.wmnet due to OOM exceptions several times in the last 24 hours (6 times, at least):

  • 2016-03-17T16:29:00
  • 2016-03-17T10:52:00
  • 2016-03-17T17:35:00
  • 2016-03-17T15:27:00
  • 2016-03-17T13:33:00
  • 2016-03-17T20:14:00

From a One of These Things Is Not Like the Others perspective, the thing which stands out is this exception:

1ERROR [CompactionExecutor:2] 2016-03-17 17:16:36,264 CassandraDaemon.java:229 - Exception in thread Thread[CompactionExecutor:2,1,main]
2java.lang.IllegalArgumentException: Not enough bytes. Offset: 2. Length: 25497. Buffer size: 16515
3 at org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:378) ~[apache-cassandra-2.1.13.jar:2.1.13]
4 at org.apache.cassandra.db.composites.AbstractCompoundCellNameType.fromByteBuffer(AbstractCompoundCellNameType.java:100) ~[apache-cassandra-2.1.13.jar:2.1.13]
5 at org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:398) ~[apache-cassandra-2.1.13.jar:2.1.13]
6 at org.apache.cassandra.db.composites.AbstractCType$Serializer.deserialize(AbstractCType.java:382) ~[apache-cassandra-2.1.13.jar:2.1.13]
7 at org.apache.cassandra.io.sstable.IndexHelper$IndexInfo$Serializer.deserialize(IndexHelper.java:193) ~[apache-cassandra-2.1.13.jar:2.1.13]
8 at org.apache.cassandra.io.sstable.IndexHelper$IndexInfo$Serializer.deserialize(IndexHelper.java:174) ~[apache-cassandra-2.1.13.jar:2.1.13]
9 at org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:129) ~[apache-cassandra-2.1.13.jar:2.1.13]
10 at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.computeNext(SSTableScanner.java:278) ~[apache-cassandra-2.1.13.jar:2.1.13]
11 at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.computeNext(SSTableScanner.java:231) ~[apache-cassandra-2.1.13.jar:2.1.13]
12 at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
13 at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
14 at org.apache.cassandra.io.sstable.SSTableScanner.hasNext(SSTableScanner.java:211) ~[apache-cassandra-2.1.13.jar:2.1.13]
15 at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) ~[apache-cassandra-2.1.13.jar:2.1.13]
16 at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) ~[apache-cassandra-2.1.13.jar:2.1.13]
17 at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) ~[apache-cassandra-2.1.13.jar:2.1.13]
18 at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
19 at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
20 at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:191) ~[apache-cassandra-2.1.13.jar:2.1.13]
21 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.13.jar:2.1.13]
22 at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:73) ~[apache-cassandra-2.1.13.jar:2.1.13]
23 at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.1.13.jar:2.1.13]
24 at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:263) ~[apache-cassandra-2.1.13.jar:2.1.13]
25 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_72-internal]
26 at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_72-internal]
27 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_72-internal]
28 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72-internal]
29 at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72-internal]

I believe this is an indication of a corrupt SSTable.

Since this occurs during compaction of the affected table (which I haven't been able to yet identify), it's cyclical, which might explain why this node's utilization looks reasonable right before it enters a death spiral and exits with OOM. But this is just a working theory; The investigation continues....

Event Timeline

Mentioned in SAL [2016-03-17T20:40:48Z] <urandom> Issuing nodetool scrub -s -- local_group_wikipedia_T_parsoid_html data on restbase2004.eqiad.wmnet : T130254

Mentioned in SAL [2016-03-18T09:08:11Z] <godog> Issuing nodetool scrub -s -- local_group_wikipedia_T_parsoid_html data on restbase2004.eqiad.wmnet : T130254

It would seem that Puppet is restarting downed nodes. This is a surprise to me, I'd have sworn we'd set this up otherwise:

The OOM and subsequent restart:

1ERROR [CompactionExecutor:31] 2016-03-19 16:09:59,978 JVMStabilityInspector.java:117 - JVM state determined to be unstable. Exiting forcefully due to:
2java.lang.OutOfMemoryError: Java heap space
3 at java.util.ArrayList.<init>(ArrayList.java:152) ~[na:1.8.0_72-internal]
4 at org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:127) ~[apache-cassandra-2.1.13.jar:2.1.13]
5 at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.computeNext(SSTableScanner.java:278) ~[apache-cassandra-2.1.13.jar:2.1.13]
6 at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.computeNext(SSTableScanner.java:231) ~[apache-cassandra-2.1.13.jar:2.1.13]
7 at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
8 at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
9 at org.apache.cassandra.io.sstable.SSTableScanner.hasNext(SSTableScanner.java:211) ~[apache-cassandra-2.1.13.jar:2.1.13]
10 at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:146) ~[apache-cassandra-2.1.13.jar:2.1.13]
11 at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:125) ~[apache-cassandra-2.1.13.jar:2.1.13]
12 at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:99) ~[apache-cassandra-2.1.13.jar:2.1.13]
13 at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
14 at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
15 at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:191) ~[apache-cassandra-2.1.13.jar:2.1.13]
16 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.13.jar:2.1.13]
17 at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:73) ~[apache-cassandra-2.1.13.jar:2.1.13]
18 at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.1.13.jar:2.1.13]
19 at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:263) ~[apache-cassandra-2.1.13.jar:2.1.13]
20 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_72-internal]
21 at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_72-internal]
22 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_72-internal]
23 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72-internal]
24 at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72-internal]
25INFO [main] 2016-03-19 16:31:19,979 CassandraDaemon.java:153 - Hostname: restbase2004.codfw.wmnet
26INFO [main] 2016-03-19 16:31:20,018 YamlConfigurationLoader.java:92 - Loading settings from file:/etc/cassandra/cassandra.yaml

Puppet log:

1Mar 19 16:31:18 restbase2004 puppet-agent[931]: (/Stage[main]/Cassandra/Cassandra::Instance[default]/Base::Service_unit[cassandra]/Service[cassandra]/ensure) ensure changed 'stopped' to 'runn
2ing'
3Mar 19 16:31:18 restbase2004 puppet-agent[931]: (/Stage[main]/Cassandra/Cassandra::Instance[default]/Base::Service_unit[cassandra]/Service[cassandra]) Unscheduling refresh on Service[cassandr
4a]

Mentioned in SAL [2016-03-22T20:47:57Z] <urandom> Stopping restbase2004.codfw.wmnet for offline sstablescrub : T130254

I have started an offline sstablescrubs of {local_group_wikipedia_T_parsoid_html,local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ}.data from root owned screen sessions.

Full disclosure: I just fat-fingered a key sequence in the screen session for the parsoid html scrub and inadvertently ctrl+c'd out of it. :(

Full disclosure: I just fat-fingered a key sequence in the screen session for the parsoid html scrub and inadvertently ctrl+c'd out of it. :(

Looking at the scrollback, the cancelled parsoid html scrub doesn't appear to have progressed past the first SSTable:

Scrubbing SSTableReader(path='/var/lib/cassandra/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/local_group_wikipedia_T_parsoid_html-data-ka-211174-Data.db') (1558364466339 bytes)

It was actively scrubbing that table (it doesn't appear to have stalled), but at this rate it will take quite some time to complete.

Full disclosure: I just fat-fingered a key sequence in the screen session for the parsoid html scrub and inadvertently ctrl+c'd out of it. :(

Looking at the scrollback, the cancelled parsoid html scrub doesn't appear to have progressed past the first SSTable:

Scrubbing SSTableReader(path='/var/lib/cassandra/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/local_group_wikipedia_T_parsoid_html-data-ka-211174-Data.db') (1558364466339 bytes)

It was actively scrubbing that table (it doesn't appear to have stalled), but at this rate it will take quite some time to complete.

https://issues.apache.org/jira/browse/CASSANDRA-11324 (Make SSTABLESCRUB and related jobs multi-threaded / parallelize somehow)

As slowly as the offline scrub is working, I'm concerned this approach isn't tractable. I propose to:

  • Start the node back up
  • Disable gossip
  • Disable the binary protocol
  • Increase concurrent compactors
  • Online scrub
  • Hope it does not OOM

Mentioned in SAL [2016-03-23T17:14:18Z] <urandom> Cancelling offline scrubs on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:16:23Z] <urandom> Disabling puppet on restbase2004.codfw.wmnet to override compactor concurrency : T130254

Mentioned in SAL [2016-03-23T17:18:27Z] <urandom> Starting Cassandra on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:21:03Z] <urandom> Disabling gossip and binary transport on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:24:09Z] <urandom> Starting scrub of parsoid_html on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:36:20Z] <urandom> Increasing compactionthroughput to 100MB/s on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:48:25Z] <urandom> Increasing compactionthroughput to 120MB/s on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T17:54:12Z] <urandom> Removing old heap dumps on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T18:05:49Z] <urandom> Incresing compactionthroughput to 200MB/s on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-23T18:12:55Z] <urandom> Removing compaction throughput throttling on restbase2004.codfw.wmnet : T130254

Mentioned in SAL [2016-03-24T16:47:01Z] <urandom> Reactivating restbase2004.codfw.wmnet : T130254]

The above strategy succeeded in completing the scrub operation, and significant corruption was found in a single ~240GB SSTable file of the local_group_wikipedia_T_parsoid_html.data table. Considering the number of rows affected (based on logstash, 34836549 rows were discarded), and the amount of time that the node was down for (including accumulated downtime from OOM exceptions, and the attempt at an offline scrub), consensus during this morning's op-services-syncup meeting was to consider the node a total loss and to rebuild it.

However, after some discussion with @fgiunchedi, it would seem that we have some days before the earliest attempt at replacing this node is possible, which provides a bit of additional time to continue work on salvaging it, (and perhaps learn something in the process). So I have brought the node back online and will monitor it closely. If the OOM exceptions have ceased, then we can consider starting a repair of the parsoid_html table.

Good News/Bad News:

The Bad:

We seem to have lost a disk; The RAID array is failed.

1[...]
2[4401904.509179] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
3[4401907.211824] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
4[4401907.211843] sd 0:0:0:0: [sda] Sense Key : Medium Error [current]
5[4401907.211849] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error
6[4401907.211852] sd 0:0:0:0: [sda] CDB:
7[4401907.211856] Write(10): 2a 00 00 00 08 08 00 00 01 00
8[4401907.211869] blk_update_request: critical medium error, dev sda, sector 2056
9[4401907.244220] md: super_written gets error=-61, uptodate=0
10[4401907.244238] md/raid1:md0: Disk failure on sda1, disabling device.
11md/raid1:md0: Operation continuing on 3 devices.
12[4401907.308528] RAID1 conf printout:
13[4401907.308531] --- wd:3 rd:4
14[4401907.308535] disk 0, wo:1, o:0, dev:sda1
15[4401907.308537] disk 1, wo:0, o:1, dev:sdb1
16[4401907.308539] disk 2, wo:0, o:1, dev:sdc1
17[4401907.308541] disk 3, wo:0, o:1, dev:sdd1
18[4401907.325791] RAID1 conf printout:
19[4401907.325795] --- wd:3 rd:4
20[4401907.325798] disk 1, wo:0, o:1, dev:sdb1
21[4401907.325799] disk 2, wo:0, o:1, dev:sdc1
22[4401907.325801] disk 3, wo:0, o:1, dev:sdd1
23[4401932.348955] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
24[4401966.301902] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
25[4401966.302019] hpsa 0000:03:00.0: invalid command: LUN:0000004000000000 CDB:0000000000000000f000000000000000
26[4401966.302023] hpsa 0000:03:00.0: probably means device no longer present
27[4401966.302087] hpsa 0000:03:00.0: invalid command: LUN:0000004000000000 CDB:0000000000000000000000f000000000
28[4401966.302089] hpsa 0000:03:00.0: probably means device no longer present
29[4401966.302092] hpsa 0000:03:00.0: FAILED abort on device C0:B0:T0:L0
30[4401966.302097] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
31[...]
32[4402114.450593] INFO: task md0_raid1:340 blocked for more than 120 seconds.
33[4402114.481110] Not tainted 3.19.0-2-amd64 #1
34[4402114.502340] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
35[4402114.538966] md0_raid1 D ffff881034465230 0 340 2 0x00000000
36[4402114.538969] ffff881034465230 0000000000000001 ffff88103924f3f0 0000000000014180
37[4402114.538971] ffff88103446bfd8 0000000000014180 ffff881034465230 0000000000000246
38[4402114.538973] ffff8810349e0800 ffff8810349e0a98 ffff8810349e0a90 ffff8810349e0818
39[4402114.538975] Call Trace:
40[4402114.539014] [<ffffffffa0126465>] ? md_super_wait+0x45/0x80 [md_mod]
41[4402114.539035] [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
42[4402114.539042] [<ffffffffa01268de>] ? md_update_sb+0x23e/0x680 [md_mod]
43[4402114.539046] [<ffffffffa0127dea>] ? md_check_recovery+0x18a/0x470 [md_mod]
44[4402114.539050] [<ffffffffa004f0e9>] ? raid1d+0x49/0xd50 [raid1]
45[4402114.539055] [<ffffffff8154eab7>] ? __schedule+0x7/0x810
46[4402114.539057] [<ffffffff8155236a>] ? schedule_timeout+0x21a/0x290
47[4402114.539062] [<ffffffffa011fc86>] ? md_thread+0x126/0x130 [md_mod]
48[4402114.539064] [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
49[4402114.539067] [<ffffffffa011fb60>] ? md_stop+0x40/0x40 [md_mod]
50[4402114.539071] [<ffffffff8108a373>] ? kthread+0xd3/0xf0
51[4402114.539073] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
52[4402114.539075] [<ffffffff81553418>] ? ret_from_fork+0x58/0x90
53[4402114.539077] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
54[4402114.539078] INFO: task jbd2/md0-8:369 blocked for more than 120 seconds.
55[4402114.570074] Not tainted 3.19.0-2-amd64 #1
56[4402114.591358] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
57[4402114.627887] jbd2/md0-8 D ffff881034464920 0 369 2 0x00000000
58[4402114.627889] ffff881034464920 ffffea0081fedf00 ffff881d7847ebe0 0000000000014180
59[4402114.627890] ffff8810349d3fd8 0000000000014180 ffff881034464920 0000000000000246
60[4402114.627893] ffff88108b1ee100 ffff88108b1ee160 ffff8810349950b8 ffff8810349d3db0
61[4402114.627894] Call Trace:
62[4402114.627915] [<ffffffffa0176076>] ? jbd2_journal_commit_transaction+0x216/0x1740 [jbd2]
63[4402114.627918] [<ffffffff8109c2e6>] ? set_next_entity+0x56/0x70
64[4402114.627930] [<ffffffff810a3650>] ? pick_next_task_fair+0x1b0/0x850
65[4402114.627935] [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
66[4402114.627938] [<ffffffff810cbd4c>] ? try_to_del_timer_sync+0x4c/0x60
67[4402114.627942] [<ffffffffa017acbd>] ? kjournald2+0xcd/0x270 [jbd2]
68[4402114.627944] [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
69[4402114.627963] [<ffffffffa017abf0>] ? commit_timeout+0x20/0x20 [jbd2]
70[4402114.627964] [<ffffffff8108a373>] ? kthread+0xd3/0xf0
71[4402114.627966] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
72[4402114.627968] [<ffffffff81553418>] ? ret_from_fork+0x58/0x90
73[4402114.627970] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
74[4402114.627974] INFO: task jbd2/dm-0-8:611 blocked for more than 120 seconds.
75[4402114.659471] Not tainted 3.19.0-2-amd64 #1
76[4402114.682356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
77[4402114.718771] jbd2/dm-0-8 D ffff88203778c250 0 611 2 0x00000000
78[4402114.718927] ffff88203778c250 ffff8810381e6000 ffff88103924f3f0 0000000000014180
79[4402114.718928] ffff882036013fd8 0000000000014180 ffff88203778c250 ffffffff8101cc05
80[4402114.719081] ffff88207fc74a30 ffff88107ffd6a60 0000000000000002 ffffffff8154fde0
81[4402114.719084] Call Trace:
82[4402114.719089] [<ffffffff8101cc05>] ? read_tsc+0x5/0x10
83[4402114.719090] [<ffffffff8154fde0>] ? bit_wait_timeout+0x80/0x80
84[4402114.719093] [<ffffffff8154f5da>] ? io_schedule+0xaa/0x140
85[4402114.719094] [<ffffffff8154fe14>] ? bit_wait_io+0x34/0x60
86[4402114.719096] [<ffffffff8154f9af>] ? __wait_on_bit+0x5f/0x90
87[4402114.719100] [<ffffffff8129d915>] ? queue_unplugged+0x25/0x90
88[4402114.719101] [<ffffffff8154fde0>] ? bit_wait_timeout+0x80/0x80
89[4402114.719102] [<ffffffff8154fa4e>] ? out_of_line_wait_on_bit+0x6e/0x80
90[4402114.719105] [<ffffffff810a9000>] ? autoremove_wake_function+0x30/0x30
91[4402114.719110] [<ffffffffa0176d0f>] ? jbd2_journal_commit_transaction+0xeaf/0x1740 [jbd2]
92[4402114.719115] [<ffffffffa017acbd>] ? kjournald2+0xcd/0x270 [jbd2]
93[4402114.719117] [<ffffffff810a8fd0>] ? wait_woken+0x90/0x90
94[4402114.719121] [<ffffffffa017abf0>] ? commit_timeout+0x20/0x20 [jbd2]
95[4402114.719124] [<ffffffff8108a373>] ? kthread+0xd3/0xf0
96[4402114.719126] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
97[4402114.719127] [<ffffffff81553418>] ? ret_from_fork+0x58/0x90
98[4402114.719129] [<ffffffff8108a2a0>] ? kthread_create_on_node+0x180/0x180
99[...]
1004403145.320775] blk_update_request: critical medium error, dev sda, sector 1029788264
101[4403145.357401] Aborting journal on device dm-0-8.
102[4403145.357491] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1022 pages, ino 143135284; err -30
103[4403145.357516] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 34865179; err -30
104[4403145.357518] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 34865379; err -30
105[4403145.357520] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 34865383; err -30
106[4403145.357727] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 143135886; err -30
107[4403145.357729] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 143136052; err -30
108[4403145.357731] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 143136167; err -30
109[4403145.357733] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 34865265; err -30
110[4403175.705868] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
111[4403176.711420] hpsa 0000:03:00.0: device is ready.
112[4403176.711429] hpsa 0000:03:00.0: Reset as abort: Device recovered from reset: scsi3addr 0x000000c000000001
113[4403176.711433] hpsa 0000:03:00.0: ABORT REQUEST on C0:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x0 REQUEST SUCCEEDED.
114[4403176.711456] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
115[4403176.749374] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
116[4403176.790437] EXT4-fs (dm-0): Remounting filesystem read-only
117[4403192.102023] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
118[4403199.302030] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
119[4403199.302050] sd 0:0:0:0: [sda] Sense Key : Medium Error [current]
120[4403199.302053] sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error
121[4403199.302054] sd 0:0:0:0: [sda] CDB:
122[4403199.302056] Write(10): 2a 00 03 9f e4 00 00 00 08 00
123[4403199.302061] blk_update_request: critical medium error, dev sda, sector 60810240
124[4403199.335848] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
125[4403199.369561] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
126[4403208.354499] hpsa 0000:03:00.0: HP SSD Smart Path: task complete with check condition.
127[4403208.354503] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
128[4403222.114900] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
129[4403229.573738] hpsa 0000:03:00.0: Abort request on C0:B0:T0:L0
130[4403230.579311] hpsa 0000:03:00.0: device is ready.
131[4403230.579321] hpsa 0000:03:00.0: Reset as abort: Device recovered from reset: scsi3addr 0x000000c000000001
132[4403230.579325] hpsa 0000:03:00.0: ABORT REQUEST on C0:B0:T0:L0 Tag:0x00000000:00000010 Command:0x2a SN:0x0 REQUEST SUCCEEDED.
133[4403252.127419] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
134[4403253.283431] EXT4-fs error (device dm-0): ext4_journal_check_start:56:
135[4403253.283436] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
136[4403253.354938] Detected aborted journal
137[4403253.354957] EXT4-fs (dm-0): ext4_writepages: jbd2_start: 1024 pages, ino 34865378; err -30
138[4406814.597501] hpsa 0000:03:00.0: HP SSD Smart Path: task complete with check condition.
139[4406814.597509] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
140[4406820.818190] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
141[4410445.560631] hpsa 0000:03:00.0: HP SSD Smart Path: task complete with check condition.
142[4410445.560639] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
143[4410449.487597] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
144[4414069.666451] hpsa 0000:03:00.0: HP SSD Smart Path: task complete with check condition.
145[4414069.666459] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
146[4414078.158926] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.
147[4417690.664284] hpsa 0000:03:00.0: HP SSD Smart Path: task complete with check condition.
148[4417690.664292] hpsa 0000:03:00.0: HP SSD Smart Path: Retrying on standard path.
149[4417706.829618] hpsa 0000:03:00.0: Direct-Access device c0b0t0l0 updated.

The Good:

A hardware issue might explain the otherwise mysterious corruption.

After the successful scrub, the node went more than 24 hours without an OOM (a record since these problems began), so I think it is safe to say that the exceptions where the result of corrupt data. And, given T130990, it seems safe to assume that corruption was the result of failing hardware.

Why Cassandra OOM'd in the face is of this corruption is not certain (it shouldn't have), but troubleshooting won't be possible. It required the scrub (destructive to the corruption) to determine where the corruption was, and the node did not have the disk space to do a snapshot and preserve the affected file. This wouldn't have been an easy task anyway, the file was about 250GB in size.

I'm not sure there is anything more than can be done with this issue, so I'm closing it.