Page MenuHomePhabricator

sinistra - RAID failure
Closed, ResolvedPublic

Description

Service
RAID
On Host
sinistra

CRITICAL: Active: 6, Working: 6, Failed: 2, Spare: 0

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=sinistra&service=RAID

Event Timeline

sinistra is not in service yet, so it's not affecting anything right now, but it's going to be the new MW logging host for codfw, -> T128796

Dzahn triaged this task as Medium priority.May 3 2016, 4:33 AM
[4610067.148387] md: using 128k window, over a total of 7716112384k.
[4629270.170823] Process accounting resumed
[4649611.578176] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649611.585573] ata4.00: irq_stat 0x40000001
[4649611.590158] ata4.00: failed command: READ DMA EXT
[4649611.595616] ata4.00: cmd 25/00:00:00:64:7a/00:0c:cf:01:00/e0 tag 11 dma 1572864 in
         res 51/40:af:4c:6a:7a/00:05:cf:01:00/e0 Emask 0x9 (media error)
[4649611.613095] ata4.00: status: { DRDY ERR }
[4649611.617769] ata4.00: error: { UNC }
[4649611.716012] ata4.00: configured for UDMA/133
[4649611.716054] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649611.716058] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649611.716061] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649611.716062] sd 3:0:0:0: [sdd] CDB: 
[4649611.716064] Read(16): 88 00 00 00 00 01 cf 7a 64 00 00 00 0c 00 00 00
[4649611.716074] blk_update_request: I/O error, dev sdd, sector 7775873612
[4649611.723498] ata4: EH complete
[4649614.048131] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649614.055538] ata4.00: irq_stat 0x40000001
[4649614.060123] ata4.00: failed command: READ DMA EXT
[4649614.065577] ata4.00: cmd 25/00:80:00:70:7a/00:0c:cf:01:00/e0 tag 13 dma 1638400 in
         res 51/40:80:00:70:7a/00:0c:cf:01:00/e0 Emask 0x9 (media error)
[4649614.083067] ata4.00: status: { DRDY ERR }
[4649614.087741] ata4.00: error: { UNC }
[4649614.171604] ata4.00: configured for UDMA/133
[4649614.171647] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649614.171651] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649614.171654] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649614.171656] sd 3:0:0:0: [sdd] CDB: 
[4649614.171658] Read(16): 88 00 00 00 00 01 cf 7a 70 00 00 00 0c 80 00 00
[4649614.171671] blk_update_request: I/O error, dev sdd, sector 7775875072
[4649614.179115] ata4: EH complete
[4649616.530110] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649616.537516] ata4.00: irq_stat 0x40000001
[4649616.542099] ata4.00: failed command: READ DMA EXT
[4649616.547558] ata4.00: cmd 25/00:80:80:7c:7a/00:0b:cf:01:00/e0 tag 16 dma 1507328 in
         res 51/40:ff:fb:82:7a/00:04:cf:01:00/e0 Emask 0x9 (media error)
[4649616.565046] ata4.00: status: { DRDY ERR }
[4649616.569719] ata4.00: error: { UNC }
[4649616.648232] ata4.00: configured for UDMA/133
[4649616.648274] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649616.648280] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649616.648284] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649616.648287] sd 3:0:0:0: [sdd] CDB: 
[4649616.648290] Read(16): 88 00 00 00 00 01 cf 7a 7c 80 00 00 0b 80 00 00
[4649616.648308] blk_update_request: I/O error, dev sdd, sector 7775879931
[4649616.655747] ata4: EH complete
[4649618.980075] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649618.987477] ata4.00: irq_stat 0x40000001
[4649618.992059] ata4.00: failed command: READ DMA EXT
[4649618.997515] ata4.00: cmd 25/00:00:00:88:7a/00:07:cf:01:00/e0 tag 19 dma 917504 in
         res 51/40:00:00:88:7a/00:07:cf:01:00/e0 Emask 0x9 (media error)
[4649619.014904] ata4.00: status: { DRDY ERR }
[4649619.019577] ata4.00: error: { UNC }
[4649619.100129] ata4.00: configured for UDMA/133
[4649619.100158] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649619.100162] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649619.100166] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649619.100168] sd 3:0:0:0: [sdd] CDB: 
[4649619.100170] Read(16): 88 00 00 00 00 01 cf 7a 88 00 00 00 07 00 00 00
[4649619.100183] blk_update_request: I/O error, dev sdd, sector 7775881216
[4649619.107607] ata4: EH complete
[4649622.819117] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649622.826519] ata4.00: irq_stat 0x40000001
[4649622.831095] ata4.00: failed command: READ DMA EXT
[4649622.836553] ata4.00: cmd 25/00:80:00:8f:7a/00:0c:cf:01:00/e0 tag 22 dma 1638400 in
         res 51/40:5f:1c:8f:7a/00:0c:cf:01:00/e0 Emask 0x9 (media error)
[4649622.854042] ata4.00: status: { DRDY ERR }
[4649622.858714] ata4.00: error: { UNC }
[4649622.936432] ata4.00: configured for UDMA/133
[4649622.936471] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649622.936475] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649622.936479] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649622.936481] sd 3:0:0:0: [sdd] CDB: 
[4649622.936483] Read(16): 88 00 00 00 00 01 cf 7a 8f 00 00 00 0c 80 00 00
[4649622.936496] blk_update_request: I/O error, dev sdd, sector 7775883036
[4649622.943934] ata4: EH complete
[4649627.899184] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649627.906588] ata4.00: irq_stat 0x40000001
[4649627.911174] ata4.00: failed command: READ DMA EXT
[4649627.916625] ata4.00: cmd 25/00:00:80:9b:7a/00:0e:cf:01:00/e0 tag 25 dma 1835008 in
         res 51/40:9f:d6:9e:7a/00:0a:cf:01:00/e0 Emask 0x9 (media error)
[4649627.934110] ata4.00: status: { DRDY ERR }
[4649627.938784] ata4.00: error: { UNC }
[4649628.023511] ata4.00: configured for UDMA/133
[4649628.023562] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[4649628.023568] sd 3:0:0:0: [sdd] Sense Key : Medium Error [current] [descriptor]
[4649628.023573] sd 3:0:0:0: [sdd] Add. Sense: Unrecovered read error - auto reallocate failed
[4649628.023577] sd 3:0:0:0: [sdd] CDB: 
[4649628.023579] Read(16): 88 00 00 00 00 01 cf 7a 9b 80 00 00 0e 00 00 00
[4649628.023598] blk_update_request: I/O error, dev sdd, sector 7775887062
[4649628.031045] ata4: EH complete
[4649636.654149] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649636.661557] ata4.00: irq_stat 0x40000001
[4649636.666141] ata4.00: failed command: WRITE DMA EXT
[4649636.671695] ata4.00: cmd 35/00:00:00:6a:7a/00:0c:cf:01:00/e0 tag 19 dma 1572864 out
         res 51/04:01:00:6a:7a/00:00:cf:01:00/e0 Emask 0x1 (device error)
[4649636.689367] ata4.00: status: { DRDY ERR }
[4649636.694040] ata4.00: error: { ABRT }
[4649636.700207] ata4.00: configured for UDMA/133
[4649636.700222] ata4: EH complete
[4649644.712594] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649644.720003] ata4.00: irq_stat 0x40000001
[4649644.724587] ata4.00: failed command: WRITE DMA EXT
[4649644.730142] ata4.00: cmd 35/00:00:00:6a:7a/00:0c:cf:01:00/e0 tag 21 dma 1572864 out
         res 51/04:00:00:6a:7a/00:0c:cf:01:00/e0 Emask 0x1 (device error)
[4649644.747822] ata4.00: status: { DRDY ERR }
[4649644.752497] ata4.00: error: { ABRT }
[4649644.758570] ata4.00: configured for UDMA/133
[4649644.758585] ata4: EH complete
[4649658.399518] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649658.406925] ata4.00: irq_stat 0x40000001
[4649658.411510] ata4.00: failed command: WRITE DMA EXT
[4649658.417065] ata4.00: cmd 35/00:80:80:82:7a/00:08:cf:01:00/e0 tag 30 dma 1114112 out
         res 51/04:01:80:82:7a/00:00:cf:01:00/e0 Emask 0x1 (device error)
[4649658.434746] ata4.00: status: { DRDY ERR }
[4649658.439419] ata4.00: error: { ABRT }
[4649658.445481] ata4.00: configured for UDMA/133
[4649658.445498] ata4: EH complete
[4649666.457927] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649666.465335] ata4.00: irq_stat 0x40000001
[4649666.469919] ata4.00: failed command: WRITE DMA EXT
[4649666.475475] ata4.00: cmd 35/00:80:80:82:7a/00:08:cf:01:00/e0 tag 2 dma 1114112 out
         res 51/04:80:80:82:7a/00:08:cf:01:00/e0 Emask 0x1 (device error)
[4649666.493058] ata4.00: status: { DRDY ERR }
[4649666.497731] ata4.00: error: { ABRT }
[4649666.503794] ata4.00: configured for UDMA/133
[4649666.503810] ata4: EH complete
[4649674.516369] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649674.523765] ata4.00: irq_stat 0x40000001
[4649674.528346] ata4.00: failed command: WRITE DMA EXT
[4649674.533902] ata4.00: cmd 35/00:80:80:82:7a/00:08:cf:01:00/e0 tag 5 dma 1114112 out
         res 51/04:80:80:82:7a/00:08:cf:01:00/e0 Emask 0x1 (device error)
[4649674.551486] ata4.00: status: { DRDY ERR }
[4649674.556158] ata4.00: error: { ABRT }
[4649674.562205] ata4.00: configured for UDMA/133
[4649674.562219] ata4: EH complete
[4649686.594005] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[4649686.601412] ata4.00: irq_stat 0x40000001
[4649686.605996] ata4.00: failed command: WRITE DMA EXT
[4649686.611552] ata4.00: cmd 35/00:00:00:8b:7a/00:0b:cf:01:00/e0 tag 11 dma 1441792 out
         res 51/04:01:00:8b:7a/00:00:cf:01:00/e0 Emask 0x1 (device error)
[4649686.629234] ata4.00: status: { DRDY ERR }
[4649686.633907] ata4.00: error: { ABRT }
[4649686.639963] ata4.00: configured for UDMA/133
[4649686.639978] ata4: EH complete
[4649694.652434] ata4: limiting SATA link speed to 3.0 Gbps
[4649694.652441] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
[4649694.659848] ata4.00: irq_stat 0x40000001
[4649694.664427] ata4.00: failed command: WRITE DMA EXT
[4649694.669980] ata4.00: cmd 35/00:00:00:8b:7a/00:0b:cf:01:00/e0 tag 14 dma 1441792 out
         res 51/04:00:00:8b:7a/00:0b:cf:01:00/e0 Emask 0x1 (device error)
[4649694.687661] ata4.00: status: { DRDY ERR }
[4649694.692333] ata4.00: error: { ABRT }
[4649694.696525] ata4: hard resetting link
[4649700.060702] ata4: link is slow to respond, please be patient (ready=0)
[4649704.712390] ata4: COMRESET failed (errno=-16)
[4649704.717459] ata4: hard resetting link
[4649710.080674] ata4: link is slow to respond, please be patient (ready=0)
[4649714.732401] ata4: COMRESET failed (errno=-16)
[4649714.737467] ata4: hard resetting link
[4649720.100683] ata4: link is slow to respond, please be patient (ready=0)
[4649749.804363] ata4: COMRESET failed (errno=-16)
[4649749.809434] ata4: limiting SATA link speed to 1.5 Gbps
[4649749.809437] ata4: hard resetting link
[4649754.836394] ata4: COMRESET failed (errno=-16)
[4649754.841464] ata4: reset failed, giving up
[4649754.846144] ata4.00: disabled
[4649754.846159] ata4: EH complete
[4649754.846258] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.846261] sd 3:0:0:0: [sdd] CDB: 
[4649754.846264] Write(16): 8a 00 00 00 00 01 cf 7a 8b 00 00 00 0b 00 00 00
[4649754.846277] blk_update_request: I/O error, dev sdd, sector 7775881984
[4649754.853715] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.853717] sd 3:0:0:0: [sdd] CDB: 
[4649754.853718] Write(16): 8a 00 00 00 00 01 cf 7a 96 00 00 00 05 80 00 00
[4649754.853734] blk_update_request: I/O error, dev sdd, sector 7775884800
[4649754.853815] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.853817] sd 3:0:0:0: [sdd] CDB: 
[4649754.853824] Write(16): 8a 00 00 00 00 00 05 d2 30 10 00 00 00 08 00 00
[4649754.853826] blk_update_request: I/O error, dev sdd, sector 97660944
[4649754.853828] md: super_written gets error=-5, uptodate=0
[4649754.853831] md/raid10:md1: Disk failure on sdd3, disabling device.
md/raid10:md1: Operation continuing on 3 devices.
[4649754.853862] sd 3:0:0:0: [sdd] Read Capacity(16) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.853864] sd 3:0:0:0: [sdd] Sense not available.
[4649754.853883] sd 3:0:0:0: [sdd] Read Capacity(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.853884] sd 3:0:0:0: [sdd] Sense not available.
[4649754.853914] sdd: detected capacity change from 4000787030016 to 0
[4649754.882051] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.882053] sd 3:0:0:0: [sdd] CDB: 
[4649754.882054] Write(16): 8a 00 00 00 00 01 cf 7a 9e 80 00 00 0b 00 00 00
[4649754.882065] blk_update_request: I/O error, dev sdd, sector 7775886976
[4649754.889479] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.889481] sd 3:0:0:0: [sdd] CDB: 
[4649754.889482] Write(16): 8a 00 00 00 00 00 01 d7 9c 88 00 00 00 10 00 00
[4649754.889509] blk_update_request: I/O error, dev sdd, sector 30907528
[4649754.889564] md: md1: data-check interrupted.
[4649754.896713] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.896715] sd 3:0:0:0: [sdd] CDB: 
[4649754.896716] Write(16): 8a 00 00 00 00 00 02 e3 98 00 00 00 00 10 00 00
[4649754.896727] blk_update_request: I/O error, dev sdd, sector 48470016
[4649754.896946] md: super_written gets error=-5, uptodate=0
[4649754.896950] md/raid10:md0: Disk failure on sdd2, disabling device.
md/raid10:md0: Operation continuing on 3 devices.
[4649754.896954] md: super_written gets error=-5, uptodate=0
[4649754.917626] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.917628] sd 3:0:0:0: [sdd] CDB: 
[4649754.917629] Write(16): 8a 00 00 00 00 00 01 c8 c6 40 00 00 00 08 00 00
[4649754.917639] blk_update_request: I/O error, dev sdd, sector 29935168
[4649754.924841] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.924843] sd 3:0:0:0: [sdd] CDB: 
[4649754.924844] Write(16): 8a 00 00 00 00 00 01 ca 45 48 00 00 00 08 00 00
[4649754.924855] blk_update_request: I/O error, dev sdd, sector 30033224
[4649754.932056] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.932059] sd 3:0:0:0: [sdd] CDB: 
[4649754.932060] Write(16): 8a 00 00 00 00 00 01 ca 47 d0 00 00 00 08 00 00
[4649754.932070] blk_update_request: I/O error, dev sdd, sector 30033872
[4649754.939273] sd 3:0:0:0: [sdd] FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[4649754.939275] sd 3:0:0:0: [sdd] CDB: 
[4649754.939276] Write(16): 8a 00 00 00 00 00 01 cf 40 f0 00 00 00 08 00 00
[4649754.939287] blk_update_request: I/O error, dev sdd, sector 30359792
[4649754.939315] ata4: exception Emask 0x1 SAct 0x0 SErr 0x0 action 0x0
[4649754.939320] ata4: irq_stat 0x40000001
[4649755.077858] RAID10 conf printout:
[4649755.077862]  --- wd:3 rd:4
[4649755.077866]  disk 0, wo:0, o:1, dev:sda2
[4649755.077869]  disk 1, wo:0, o:1, dev:sdb2
[4649755.077871]  disk 2, wo:0, o:1, dev:sdc2
[4649755.077872]  disk 3, wo:1, o:0, dev:sdd2
[4649755.085190] RAID10 conf printout:
[4649755.085195]  --- wd:3 rd:4
[4649755.085198]  disk 0, wo:0, o:1, dev:sda2
[4649755.085201]  disk 1, wo:0, o:1, dev:sdb2
[4649755.085203]  disk 2, wo:0, o:1, dev:sdc2
[4649755.927829] RAID10 conf printout:
[4649755.927834]  --- wd:3 rd:4
[4649755.927838]  disk 0, wo:0, o:1, dev:sda3
[4649755.927840]  disk 1, wo:0, o:1, dev:sdb3
[4649755.927842]  disk 2, wo:0, o:1, dev:sdc3
[4649755.927844]  disk 3, wo:1, o:0, dev:sdd3
[4649755.989321] RAID10 conf printout:
[4649755.989326]  --- wd:3 rd:4
[4649755.989330]  disk 0, wo:0, o:1, dev:sda3
[4649755.989332]  disk 1, wo:0, o:1, dev:sdb3
[4649755.989334]  disk 2, wo:0, o:1, dev:sdc3
[4715739.054678] Process accounting resumed

"..Disk failure on sdd3".. & ".. Disk failure on sdd2"

Sinistra is under warranty until 2016-03-02. @Papaul can get dell to dispatch a replacement disk.

Dell Customer Communication

Hi Papaul,

I’ve just submitted dispatch # 318524567 for this hard drive to arrive tomorrow. Let me know when you get it, and can confirm the issue is resolved.

Thanks,

Papaul reassigned this task from Papaul to Dzahn.
Papaul added a subscriber: Papaul.

@Dzahn Drive replacement complete.

faidon added a subscriber: faidon.

The disk may have been replaced, but it wasn't partitioned/re-added to RAID, so the original request (RAID failure) is still not resolved. @Dzahn, will you handle?

@RobH will handle, he already started with the RAID rebuild.

The new disk isn't showing in the software. @Papaul: Is the new disk showing a green LED for power, and if not can it be reseated?

Please advise, and don't close this task until we have the disk back online. Once we have the disk detecting, I'll be able to ensure it rebuilds the raid.

Thanks!

@RobH the disk is insert and showing green light

Ok, I'll try rebooting and looking at it in bios.

So I had to install gdisk tools, as I needed sgdisk to copy the GPT partitions. Then after the clone, I attempted to randomize the GUID of the NEW disk, but somehow did it for ALL the disks and messed up grub.

Since this is a new system that had not yet been pushed into service, its just faster to reinstall it at this point.

So my attempt to install sgdisk and copy partitions worked, but then my command to randomize the GUID of the new disk (since it copied SDC) failed and randomized the GUIDs of every disk.

At that point, since there is no data to keep intact, it was simply easier to reinstall.

System is reinstalled and back online.