Page MenuHomePhabricator

Disk errors: restbase1010.eqiad.wmnet
Closed, ResolvedPublic

Description

restbase1010-c.eqiad.wmnet has self-terminated due to an IO error:

1ERROR [CompactionExecutor:29856] 2017-08-28 19:51:11,714 CassandraDaemon.java:185 - Exception in thread Thread[CompactionExecutor:29856,1,main]
2org.apache.cassandra.io.FSWriteError: java.io.IOException: Input/output error
3 at org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:260) ~[apache-cassandra-2.2.6.jar:2.2.6]
4 at org.apache.cassandra.io.util.SequentialWriter.syncInternal(SequentialWriter.java:269) ~[apache-cassandra-2.2.6.jar:2.2.6]
5 at org.apache.cassandra.io.util.SequentialWriter.sync(SequentialWriter.java:249) ~[apache-cassandra-2.2.6.jar:2.2.6]
6 at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:331) ~[apache-cassandra-2.2.6.jar:2.2.6]
7 at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:302) ~[apache-cassandra-2.2.6.jar:2.2.6]
8 at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:350) ~[apache-cassandra-2.2.6.jar:2.2.6]
9 at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169) ~[apache-cassandra-2.2.6.jar:2.2.6]
10 at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.doPrepare(CompactionAwareWriter.java:79) ~[apache-cassandra-2.2.6.jar:2.2.6]
11 at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169) ~[apache-cassandra-2.2.6.jar:2.2.6]
12 at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:179) ~[apache-cassandra-2.2.6.jar:2.2.6]
13 at org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.finish(CompactionAwareWriter.java:89) ~[apache-cassandra-2.2.6.jar:2.2.6]
14 at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:196) ~[apache-cassandra-2.2.6.jar:2.2.6]
15 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.2.6.jar:2.2.6]
16 at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74) ~[apache-cassandra-2.2.6.jar:2.2.6]
17 at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.2.6.jar:2.2.6]
18 at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256) ~[apache-cassandra-2.2.6.jar:2.2.6]
19 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_141]
20 at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_141]
21 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_141]
22 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_141]
23 at java.lang.Thread.run(Thread.java:748) [na:1.8.0_141]
24Caused by: java.io.IOException: Input/output error
25 at sun.nio.ch.FileDispatcherImpl.force0(Native Method) ~[na:1.8.0_141]
26 at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76) ~[na:1.8.0_141]
27 at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388) ~[na:1.8.0_141]
28 at org.apache.cassandra.utils.SyncUtil.force(SyncUtil.java:142) ~[apache-cassandra-2.2.6.jar:2.2.6]
29 at org.apache.cassandra.io.util.SequentialWriter.syncDataOnlyInternal(SequentialWriter.java:256) ~[apache-cassandra-2.2.6.jar:2.2.6]
30 ... 20 common frames omitted
31ERROR [CompactionExecutor:29856] 2017-08-28 19:51:11,722 StorageService.java:467 - Stopping gossiper
32WARN [CompactionExecutor:29856] 2017-08-28 19:51:11,723 StorageService.java:373 - Stopping gossip by operator request
33INFO [CompactionExecutor:29856] 2017-08-28 19:51:11,723 Gossiper.java:1448 - Announcing shutdown
34INFO [CompactionExecutor:29856] 2017-08-28 19:51:11,724 StorageService.java:1937 - Node /10.64.0.116 state jump to shutdown
35ERROR [CompactionExecutor:29856] 2017-08-28 19:51:13,729 StorageService.java:477 - Stopping native transport
36INFO [CompactionExecutor:29856] 2017-08-28 19:51:13,865 Server.java:218 - Stop listening for CQL clients

dmesg says:

1[4689412.160666] sd 0:1:0:2: [sdc] tag#48 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
2[4689412.160684] sd 0:1:0:2: [sdc] tag#48 Sense Key : Medium Error [current]
3[4689412.160690] sd 0:1:0:2: [sdc] tag#48 Add. Sense: Unrecovered read error
4[4689412.160695] sd 0:1:0:2: [sdc] tag#48 CDB: Write(10) 2a 00 12 88 49 50 00 02 00 00
5[4689412.160700] blk_update_request: critical medium error, dev sdc, sector 310921552
6[4689412.195726] EXT4-fs warning (device dm-0): ext4_end_bio:314: I/O error -61 writing to inode 82116763 (offset 136749056 size 5857280 starting block 156319594)
7[4689412.195731] Buffer I/O error on device dm-0, logical block 156319594
8[4689412.226440] Buffer I/O error on device dm-0, logical block 156319595
9[4689412.256113] Buffer I/O error on device dm-0, logical block 156319596
10[4689412.285882] Buffer I/O error on device dm-0, logical block 156319597
11[4689412.315504] Buffer I/O error on device dm-0, logical block 156319598
12[4689412.344913] Buffer I/O error on device dm-0, logical block 156319599
13[4689412.374295] Buffer I/O error on device dm-0, logical block 156319600
14[4689412.404149] Buffer I/O error on device dm-0, logical block 156319601
15[4689412.433655] Buffer I/O error on device dm-0, logical block 156319602
16[4689412.463053] Buffer I/O error on device dm-0, logical block 156319603
17[4689412.602380] JBD2: Detected IO errors while flushing file data on dm-0-8
18[4689417.870956] JBD2: Detected IO errors while flushing file data on dm-0-8
19[4689427.742490] JBD2: Detected IO errors while flushing file data on dm-0-8
20[4689432.631809] JBD2: Detected IO errors while flushing file data on dm-0-8
21[4689437.746715] JBD2: Detected IO errors while flushing file data on dm-0-8
22[4689442.639427] JBD2: Detected IO errors while flushing file data on dm-0-8
23[4689447.753478] JBD2: Detected IO errors while flushing file data on dm-0-8
24[4689452.632068] JBD2: Detected IO errors while flushing file data on dm-0-8
25[4689457.732869] JBD2: Detected IO errors while flushing file data on dm-0-8
26[4689461.383128] JBD2: Detected IO errors while flushing file data on dm-0-8
27[4689467.662513] JBD2: Detected IO errors while flushing file data on dm-0-8
28[4689469.197757] JBD2: Detected IO errors while flushing file data on dm-0-8
29[4689471.314641] JBD2: Detected IO errors while flushing file data on dm-0-8
30[4689473.588520] JBD2: Detected IO errors while flushing file data on dm-0-8
31[4689478.755567] JBD2: Detected IO errors while flushing file data on dm-0-8
32[4689483.454820] JBD2: Detected IO errors while flushing file data on dm-0-8
33[4689488.835768] JBD2: Detected IO errors while flushing file data on dm-0-8
34[4689491.350983] JBD2: Detected IO errors while flushing file data on dm-0-8
35[4689495.504508] JBD2: Detected IO errors while flushing file data on dm-0-8
36[4689497.040346] JBD2: Detected IO errors while flushing file data on dm-0-8
37[4689501.323144] JBD2: Detected IO errors while flushing file data on dm-0-8
38[4689501.780883] JBD2: Detected IO errors while flushing file data on dm-0-8
39[4689503.454151] JBD2: Detected IO errors while flushing file data on dm-0-8
40[4689507.017450] JBD2: Detected IO errors while flushing file data on dm-0-8
41[4689511.157825] JBD2: Detected IO errors while flushing file data on dm-0-8
42[4689513.462279] JBD2: Detected IO errors while flushing file data on dm-0-8
43[4689518.475612] JBD2: Detected IO errors while flushing file data on dm-0-8
44[4689523.502486] JBD2: Detected IO errors while flushing file data on dm-0-8
45[4689526.479120] JBD2: Detected IO errors while flushing file data on dm-0-8
46[4689526.620319] JBD2: Detected IO errors while flushing file data on dm-0-8
47[4689530.101278] JBD2: Detected IO errors while flushing file data on dm-0-8
48[4689533.466443] JBD2: Detected IO errors while flushing file data on dm-0-8
49[4689535.687951] JBD2: Detected IO errors while flushing file data on dm-0-8
50[4689540.121772] JBD2: Detected IO errors while flushing file data on dm-0-8
51[4689545.126533] JBD2: Detected IO errors while flushing file data on dm-0-8
52[4689548.339910] JBD2: Detected IO errors while flushing file data on dm-0-8
53[4689555.130860] JBD2: Detected IO errors while flushing file data on dm-0-8
54[4689555.677388] JBD2: Detected IO errors while flushing file data on dm-0-8
55[4689560.151387] JBD2: Detected IO errors while flushing file data on dm-0-8
56[4689565.111847] JBD2: Detected IO errors while flushing file data on dm-0-8
57[4689565.678216] JBD2: Detected IO errors while flushing file data on dm-0-8
58[4689571.472751] JBD2: Detected IO errors while flushing file data on dm-0-8
59[4689575.677467] JBD2: Detected IO errors while flushing file data on dm-0-8
60[4689576.465652] JBD2: Detected IO errors while flushing file data on dm-0-8
61[4689581.467940] JBD2: Detected IO errors while flushing file data on dm-0-8

Event Timeline

Eevans created this task.Aug 28 2017, 8:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 28 2017, 8:23 PM
Eevans triaged this task as High priority.Aug 28 2017, 8:23 PM
Volans added a subscriber: Volans.Aug 28 2017, 9:02 PM

Adding ops-eqiad, looks like we'll probably end up replacing the disk

We needed to decommission a node in rack 'a' as part of T169939: End of August milestone: Cassandra 3 cluster in production, that was going to be 1007 (for consistency sake), but restbase1010 has been decommissioned instead. It can be taken down to troubleshoot / replace sdc at any time.

@Cmjohnson Can you confirm whether we have spare Samsung drives for this in inventory? Do you have an ETA on replacement, so that we know how to plan on our end?

@Eevans no we do not. Do you want it fixed or a disk ordered? I may have misunderstood what you meant by decommission.

Cmjohnson added a subscriber: RobH.Aug 31 2017, 6:24 PM

looping @RobH in to order a new disk.

No Samsung spares would be surprising, given our last conversation on the topic in April, and from what I remember about the stock back then.

@Eevans no we do not. Do you want it fixed or a disk ordered? I may have misunderstood what you meant by decommission.

It was decommissioned primarily because it had a bad drive, but when we recommission it, it will be into a different cluster (for which T169939 depends).

RobH added a comment.Sep 5 2017, 5:08 PM

Background:

Restbase1010 was ordered on T126049, which did NOT include SSDs. Instead, the non-standard Samsung SSDs were pulled from existing restbase hosts restbase100[1-6] for these 6 new hosts, restbase10(0[7-9]|1[0-2]).

@Cmjohnson: The spares trackign sheet shows the following samsung SSDs, 3 of the 5 assigned? Is this sheet incorrect, and if so, please update it and we can order more disks.

SSD Samsung Samsung SSD 850 PRO MZ-7KE1T0BW 1024 5 5 New 3 assigned to T126638

These show as 1TB SSDs, like is needed for this replacement, and shows 2 spare.

Cmjohnson added a comment.EditedSep 5 2017, 6:54 PM

I do have spares....sorry forgot about those...Samsungs are not standard disks for us.

RobH added a comment.Sep 5 2017, 7:07 PM

So here is the thing for this particular error, while there are IO errors included above, the actual raid controller reports the disk is fine.

> ctrl slot=0 ld 3 show

Smart Array P440ar in Slot 0 (Embedded)

array C

   Logical Drive: 3
      Size: 953.8 GB
      Fault Tolerance: 0
      Heads: 255
      Sectors Per Track: 32
      Cylinders: 65535
      Strip Size: 256 KB
      Full Stripe Size: 256 KB
      Status: OK
      MultiDomain Status: OK
      Caching:  Disabled
      Unique Identifier: 600508B1001CEA7F70B3BDF0F735E251
      Disk Name: /dev/sdc 
      Mount Points: None
      Logical Drive Label: 087B0BBAPDNLH0BRH9Y3L2D4FC
      Drive Type: Data
      LD Acceleration Method: HP SSD Smart Path

These don't use a typical raid setup, but instead seem to have placed each single ssd into its own raid0. Will restbase handle the loss of a disk just fine, or do we need to have this server depooled before replacement?

Additionally, the errors in the original task description list sdc, and sdc shows on the raid controller as this disk. None of the disks in the raid controller report errors though.

Cmjohnson removed a project: ops-eqiad.

Replaced the disk with an on-site spare. Verified that the disk in slot 2 was /dev/sdc.....once I pulled it out from the server the raid cfg showed /dev/sdc1 failed. assigning to @fgiunchedi to re-image and removing ops-eqiad tag.

Eevans closed this task as Resolved.Sep 6 2017, 4:08 PM