Page MenuHomePhabricator

restbase2004.codfw.wmnet data corruption
Closed, ResolvedPublic

Description

2016-09-06 16:30:43: Icinga registered a service failure of CQL on restbase2004-b.codfw.wmnet, the result of a shutdown after encountering data corruption.

ERROR [CompactionExecutor:12945] 2016-09-06 16:24:37,145 CassandraDaemon.java:185 - Exception in threa
d Thread[CompactionExecutor:12945,1,main]
org.apache.cassandra.io.FSReadError: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35707-big-Data.db
        at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:358) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:359) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:322) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:126) ~[apache-cassandra-2.2.6.jar:2.2.6]
        [...]

Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35707-big-Data.db
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferStandard(CompressedRandomAccessReader.java:153) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:230) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:346) ~[apache-cassandra-2.2.6.jar:2.2.6]
        ... 30 common frames omitted
Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35707-big-Data.db): corruption detected, chunk at 24958152 of length 57619.
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferStandard(CompressedRandomAccessReader.java:124) ~[apache-cassandra-2.2.6.jar:2.2.6]
        ... 33 common frames omitted
ERROR [CompactionExecutor:12945] 2016-09-06 16:24:37,147 StorageService.java:467 - Stopping gossiper
WARN  [CompactionExecutor:12945] 2016-09-06 16:24:37,147 StorageService.java:373 - Stopping gossip by operator request
INFO  [CompactionExecutor:12945] 2016-09-06 16:24:37,147 Gossiper.java:1448 - Announcing shutdown
INFO  [CompactionExecutor:12945] 2016-09-06 16:24:37,149 StorageService.java:1937 - Node /10.192.32.138 state jump to shutdown
ERROR [CompactionExecutor:12945] 2016-09-06 16:24:39,170 StorageService.java:477 - Stopping native transport
INFO  [CompactionExecutor:12945] 2016-09-06 16:24:39,255 Server.java:218 - Stop listening for CQL clients
NOTE: All exceptions are for /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35707-big-Data.db

Additionally, this was found in dmesg:

10923014.250916] hpsa 0000:03:00.0: scsi 0:1:0:1 Aborting command ffff880cd6f1c9c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  BEING SENT
[10923014.250922] hpsa 0000:03:00.0: scsi 0:1:0:1: Aborting command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
[10923014.250960] hpsa 0000:03:00.0: scsi 0:1:0:1 Aborting command ffff880cd6f1c9c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  SENT, FAILED
[10923014.250968] hpsa 0000:03:00.0: scsi 0:1:0:1: FAILED to abort command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
[10923031.235036] hpsa 0000:03:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
[10923046.114183] hpsa 0000:03:00.0: aborted: LUN:000000c000000101 CDB:12000000310000000000000000000000
[10923046.114189] hpsa 0000:03:00.0: hpsa_update_device_info: inquiry failed
[10923046.146611] hpsa 0000:03:00.0: Inquiry failed, skipping device.
[10923046.163938] hpsa 0000:03:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
[10923046.173235] hpsa 0000:03:00.0: scsi 0:0:1:0: removed Direct-Access     ATA      Samsung SSD 850  PHYS DRV SSDSmartPathCap- En- Exp=0
[10923136.937859] hpsa 0000:03:00.0: scsi 0:0:1:0: masked Direct-Access     ATA      Samsung SSD 850  PHYS DRV SSDSmartPathCap- En- Exp=0

2016-09-07 14:10:11: Similar corruption has become evident on 2004-c.

ERROR [CompactionExecutor:15586] 2016-09-07 14:37:32,955 CassandraDaemon.java:201 - Exception in thread Thread[CompactionExecutor:15586,1,main]
org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: /srv/cassandra-c/data/local_group_globaldomain_T_mathoid_png/data-776488b0ef6911e59c486ffb300f2009/la-9319-big-Data.db
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferStandard(CompressedRandomAccessReader.java:153) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:230) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:42) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:346) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:359) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:322) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.ColumnSerializer.deserializeColumnBody(ColumnSerializer.java:132) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.OnDiskAtom$Serializer.deserializeFromSSTable(OnDiskAtom.java:86) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:52) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.AbstractCell$1.computeNext(AbstractCell.java:46) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.hasNext(SSTableIdentityIterator.java:169) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.MergeIterator$OneToOne.computeNext(MergeIterator.java:202) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
        at com.google.common.collect.Iterators$7.computeNext(Iterators.java:645) ~[guava-16.0.jar:na]
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) ~[guava-16.0.jar:na]
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) ~[guava-16.0.jar:na]
        at org.apache.cassandra.db.ColumnIndex$Builder.buildForCompaction(ColumnIndex.java:166) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.LazilyCompactedRow.write(LazilyCompactedRow.java:125) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:136) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:117) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.append(DefaultCompactionWriter.java:64) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:184) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256) ~[apache-cassandra-2.2.6.jar:2.2.6]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_102]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_102]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_102]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_102]
Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/srv/cassandra-c/data/local_group_globaldomain_T_mathoid_png/data-776488b0ef6911e59c486ffb300f2009/la-9319-big-Data.db): corruption detected, chunk at 90435132 of length 243858.
        at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferStandard(CompressedRandomAccessReader.java:124) ~[apache-cassandra-2.2.6.jar:2.2.6]
        ... 33 common frames omitted
NOTE: There is no additional output in dmesg.
NOTE: All exceptions are for /srv/cassandra-c/data/local_group_globaldomain_T_mathoid_png/data-776488b0ef6911e59c486ffb300f2009/la-9319-big-Data.db.

2016-09-10 05:26:29: Back to local_group_wikipedia_T_parsoid_html and instance 2004-b.

NOTE: There is no additional output in dmesg.
NOTE: Exceptions are for two files (same table), /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35707-big-Data.db, and /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-35726-big-Data.db.

2016-09-12 06:44:53: Once more, local_group_wikipedia_T_parsoid_html on 2004-b.

NOTE: No additional dmesg output.
NOTE: Exceptions are all for the same file, /srv/cassandra-b/data/local_group_wikipedia_T_parsoid_html/data-f3648bc0c2cb11e49ce6a1da77f2fd34/la-36300-big-Data.db.

Event Timeline

Mentioned in SAL [2016-09-06T17:02:21Z] <urandom> T144826: Restaring Cassandra on restbase2004-a.codfw.wmnet

Mentioned in SAL [2016-09-06T17:10:27Z] <urandom> T144826: Starting online scrub

Mentioned in SAL [2016-09-06T17:13:33Z] <urandom> T144826: Lifting compaction throttle

Eevans renamed this task from restbase2004-b.codfw.wmnet CQL service failure to restbase2004-b.codfw.wmnet data corruption.Sep 6 2016, 5:17 PM
Eevans triaged this task as High priority.
Eevans updated the task description. (Show Details)

Mentioned in SAL [2016-09-06T17:20:24Z] <urandom> T144826: Ephemerally increasing compactor thread count from 10 to 20

@fgiunchedi reports (on IRC), that he checked the raid status and all disks are reported as being healthy.

Current status:

An online scrub has been running since 17:10UTC. For purposes of this scrub, compaction throughput has been set to 0 (disabled), and the number of compaction threads was doubled; Since the machine is for all intents and purposes "off-line", it has been (ephemerally) reconfigured for maximum compaction throughput.

We are now down to the final two scrubs, the larger of which is at ~44%.

No further error messages have been logged to dmesg at this time.

The plan is: When (/if) the scrub completes successfully, we can attempt to bring it back online. I'm estimating this to be ~02:00UTC.

Mentioned in SAL [2016-09-07T02:06:39Z] <urandom> T144826: Restarting Cassandra restbase2004-b.codfw.wmnet (putting back into service)

Eevans renamed this task from restbase2004-b.codfw.wmnet data corruption to restbase2004.codfw.wmnet data corruption.Sep 7 2016, 2:38 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

Mentioned in SAL [2016-09-07T14:45:03Z] <urandom> T144826: Removing compaction rate limit, increasing compactor threads from 10 to 20, and beginning scrub of local_group_globaldomain_T_mathoid_png.data

Mentioned in SAL [2016-09-07T14:45:18Z] <urandom> T144826: Removing compaction rate limit, increasing compactor threads from 10 to 20, and beginning scrub of local_group_globaldomain_T_mathoid_png.data (restbase2004-c.codfw.wmnet)

Mentioned in SAL [2016-09-07T14:49:04Z] <urandom> T144826: Restarting Cassandra on restbase2004-c.codfw.wmnet (scrub complete, re-joining cluster)

Mentioned in SAL [2016-09-10T12:36:42Z] <urandom> T144826: Removing compaction rate limit, increasing compactor threads (from 10 to 20), and beginning scrub of local_group_wikipedia_T_parsoid_html.data (restbase2004-b.codfw.wmnet)

Mentioned in SAL [2016-09-10T22:30:22Z] <urandom> T144826: Restarting Cassandra on restbase2004-b.codfw.wmnet (scrub complete, re-joining cluster)

Mentioned in SAL [2016-09-12T14:36:58Z] <urandom> T144826: Removing compaction rate limit, increasing compactor threads (from 10 to 20), and beginning scrub of local_group_wikipedia_T_parsoid_html.data (restbase2004-b.codfw.wmnet)

I've checked via the syslog server for reoccurence of this error in dmesg and only restbase2004 has a non-inquiry CDB reported

$ fgrep restbase /tmp/hpsa_syslog 
syslog.log-20160907.gz:Sep  6 06:46:00 restbase2004 kernel: [10923014.250916] hpsa 0000:03:00.0: scsi 0:1:0:1 Aborting command ffff880cd6f1c9c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  BEING SENT
syslog.log-20160907.gz:Sep  6 06:46:00 restbase2004 kernel: [10923014.250922] hpsa 0000:03:00.0: scsi 0:1:0:1: Aborting command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
syslog.log-20160907.gz:Sep  6 06:46:00 restbase2004 kernel: [10923014.250960] hpsa 0000:03:00.0: scsi 0:1:0:1 Aborting command ffff880cd6f1c9c0Tag:0x00000000:00000120 CDBLen: 10 CDB: 0x2a00... SN: 0x0  SENT, FAILED
syslog.log-20160907.gz:Sep  6 06:46:00 restbase2004 kernel: [10923014.250968] hpsa 0000:03:00.0: scsi 0:1:0:1: FAILED to abort command Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
syslog.log-20160907.gz:Sep  6 06:46:17 restbase2004 kernel: [10923031.235036] hpsa 0000:03:00.0: scsi 0:1:0:1: resetting logical  Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
syslog.log-20160907.gz:Sep  6 06:46:32 restbase2004 kernel: [10923046.114183] hpsa 0000:03:00.0: aborted: LUN:000000c000000101 CDB:12000000310000000000000000000000
syslog.log-20160907.gz:Sep  6 06:46:32 restbase2004 kernel: [10923046.114189] hpsa 0000:03:00.0: hpsa_update_device_info: inquiry failed
syslog.log-20160907.gz:Sep  6 06:46:32 restbase2004 kernel: [10923046.146611] hpsa 0000:03:00.0: Inquiry failed, skipping device.
syslog.log-20160907.gz:Sep  6 06:46:32 restbase2004 kernel: [10923046.163938] hpsa 0000:03:00.0: scsi 0:1:0:1: reset logical  completed successfully Direct-Access     HP       LOGICAL VOLUME   RAID-0 SSDSmartPathCap+ En+ Exp=1
syslog.log-20160907.gz:Sep  6 06:46:32 restbase2004 kernel: [10923046.173235] hpsa 0000:03:00.0: scsi 0:0:1:0: removed Direct-Access     ATA      Samsung SSD 850  PHYS DRV SSDSmartPathCap- En- Exp=0
syslog.log-20160907.gz:Sep  6 06:48:02 restbase2004 kernel: [10923136.937859] hpsa 0000:03:00.0: scsi 0:0:1:0: masked Direct-Access     ATA      Samsung SSD 850  PHYS DRV SSDSmartPathCap- En- Exp=0

so on Sep 6 06:46:00 a scsi write failed (CDB 0x2a00), on Sep 6 06:46:17 the controller reset the drive.
The later messages on Sep 6 06:46:32 refer to inquiry (CDB 0x1200) and on Sep 6 06:48:02 the controller masked the direct access device scsi 0:0:1:0 which we're not using, as opposed to the "raid device" scsi 0:1:0:1 from earlier. The disk in question is sdb: lrwxrwxrwx 1 root root 9 Sep 12 10:23 pci-0000:03:00.0-scsi-0:1:0:1 -> ../../sdb

To summarize an IRC convo with @fgiunchedi, I will wait for the current scrub to complete, then online the node. Tomorrow we will try rebooting the host (an act of desperation), when he is around (in case things go poorly).

Mentioned in SAL [2016-09-12T23:33:31Z] <urandom> T144826: Restarting Cassandra on restbase2004-b.codfw.wmnet (scrub complete, re-joining cluster)

restbase2004 rebooted now, back up and all instances started, though no errors reported

Mentioned in SAL (#wikimedia-operations) [2016-09-13T16:17:05Z] <urandom> T144826: Removing compaction rate limit, increasing compactor threads (from 10 to 20), and beginning scrub of local_group_wikipedia_T_parsoid_html.data (restbase2004-b.codfw.wmnet)

Eevans claimed this task.

Given the rate at which corruption was occurring, it would seem that at this time, it has stopped.

That a reboot would "fix" this doesn't inspire confidence, but if the issue doesn't recur, I'm not sure what else there is to do, so I will close this issue (and go knock on some wood).