Page MenuHomePhabricator

Broken disk on thanos-be1003 but not reported / task not opened
Closed, ResolvedPublic

Description

Noticed this via puppet failing on thanos-be1003, namely sdf apparently failed but the controller removed the VD (?) and thus no alerting / task was issued.

Specifically for megaraid_sas driver:

Jun 27 01:50:00 thanos-be1003 kernel: [11723732.966123] megaraid_sas 0000:3b:00.0: 2214 (678073783s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 05/5
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.966512] megaraid_sas 0000:3b:00.0: 2215 (678073783s/0x0001/FATAL) - VD 05/5 is now OFFLINE
Jun 27 02:09:38 thanos-be1003 kernel: [11724910.615545] megaraid_sas 0000:3b:00.0: 2243 (678074963s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 3

And I can't find traces of VD 5 from megacli

# megacli -LdPdInfo -aAll | grep Virtual
Number of Virtual Disks: 13
Virtual Drive: 0 (Target Id: 0)
Virtual Drive: 1 (Target Id: 1)
Virtual Drive: 2 (Target Id: 2)
Virtual Drive: 3 (Target Id: 3)
Virtual Drive: 4 (Target Id: 4)
Virtual Drive: 6 (Target Id: 6)
Virtual Drive: 7 (Target Id: 7)
Virtual Drive: 8 (Target Id: 8)
Virtual Drive: 9 (Target Id: 9)
Virtual Drive: 10 (Target Id: 10)
Virtual Drive: 11 (Target Id: 11)
Virtual Drive: 12 (Target Id: 12)
Virtual Drive: 13 (Target Id: 13)

And the PD is seemingly gone:

# megacli -PDList -aALL | grep -i 'firmware state'
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
# megacli -PDList -aALL | grep -i 'firmware state' | wc -l
13
# should be 14 PDs

In other words both the PD and the VD are gone as in the controller fully lost track of them. I don't remember offhand if we've seen this type of failure before but it is obviously worrying when disks can disappear from the controller's reporting.

full dmesg
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.960903] sd 0:2:5:0: [sdf] tag#19 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.960911] sd 0:2:5:0: [sdf] tag#19 CDB: Write(16) 8a 00 00 00 00 00 00 00 5c 78 00 00 00 08 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.960915] print_req_error: I/O error, dev sdf, sector 23672
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961293] sd 0:2:5:0: [sdf] tag#95 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961628] sd 0:2:5:0: [sdf] tag#478 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961635] sd 0:2:5:0: [sdf] tag#478 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 02 00 00 00 01 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961639] print_req_error: I/O error, dev sdf, sector 2050
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961684] sd 0:2:5:0: [sdf] tag#479 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961687] sd 0:2:5:0: [sdf] tag#479 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 18 00 00 00 08 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961689] print_req_error: I/O error, dev sdf, sector 2072
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961698] sd 0:2:5:0: [sdf] tag#480 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961700] sd 0:2:5:0: [sdf] tag#480 CDB: Write(16) 8a 00 00 00 00 00 00 00 09 40 00 00 00 08 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961702] print_req_error: I/O error, dev sdf, sector 2368
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961709] sd 0:2:5:0: [sdf] tag#481 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961711] sd 0:2:5:0: [sdf] tag#481 CDB: Write(16) 8a 00 00 00 00 00 00 00 26 d0 00 00 00 08 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961713] print_req_error: I/O error, dev sdf, sector 9936
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961721] sd 0:2:5:0: [sdf] tag#482 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961723] sd 0:2:5:0: [sdf] tag#482 CDB: Write(16) 8a 00 00 00 00 00 00 00 2a 00 00 00 00 20 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961725] print_req_error: I/O error, dev sdf, sector 10752
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961733] sd 0:2:5:0: [sdf] tag#483 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961735] sd 0:2:5:0: [sdf] tag#483 CDB: Write(16) 8a 00 00 00 00 00 00 00 45 e0 00 00 00 20 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961737] print_req_error: I/O error, dev sdf, sector 17888
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961743] sd 0:2:5:0: [sdf] tag#484 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961746] sd 0:2:5:0: [sdf] tag#484 CDB: Write(16) 8a 00 00 00 00 00 00 00 e2 f8 00 00 00 08 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961747] print_req_error: I/O error, dev sdf, sector 58104
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961755] sd 0:2:5:0: [sdf] tag#485 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961758] sd 0:2:5:0: [sdf] tag#485 CDB: Write(16) 8a 00 00 00 00 00 00 01 ae c0 00 00 00 20 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961759] print_req_error: I/O error, dev sdf, sector 110272
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961765] print_req_error: I/O error, dev sdf, sector 192352
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.961819] XFS (sdf1): metadata I/O error in "xfs_buf_iodone_callback_error" at daddr 0x2 len 1 error 5
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.965967] megaraid_sas 0000:3b:00.0: scanning for scsi0...
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.966123] megaraid_sas 0000:3b:00.0: 2214 (678073783s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 05/5
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.966512] megaraid_sas 0000:3b:00.0: 2215 (678073783s/0x0001/FATAL) - VD 05/5 is now OFFLINE
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.967038] XFS (sdf1): writeback error on sector 23680
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.972859] sd 0:2:5:0: [sdf] tag#95 CDB: Write(16) 8a 00 00 00 00 00 e8 fe 66 8a 00 00 00 0c 00 00
Jun 27 01:50:00 thanos-be1003 kernel: [11723732.973084] XFS (sdf1): metadata I/O error in "xlog_iodone" at daddr 0xe8fe5e8a len 64 error 5
Jun 27 01:50:00 thanos-be1003 kernel: [11723733.044996] XFS (sdf1): xfs_do_force_shutdown(0x2) called from line 1271 of file fs/xfs/xfs_log.c.  Return address = 0000000030796933
Jun 27 01:50:00 thanos-be1003 kernel: [11723733.045016] XFS (sdf1): Log I/O Error Detected.  Shutting down filesystem
Jun 27 01:50:00 thanos-be1003 kernel: [11723733.052115] XFS (sdf1): Please umount the filesystem and rectify the problem(s)
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.344729] scsi_io_completion_action: 84 callbacks suppressed
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.344734] sd 0:2:5:0: [sdf] tag#374 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.344738] sd 0:2:5:0: [sdf] tag#374 CDB: Read(16) 88 00 00 00 00 01 d1 af f7 80 00 00 00 08 00 00
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.344739] print_req_error: 84 callbacks suppressed
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.344740] print_req_error: I/O error, dev sdf, sector 7812937600
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.351221] sd 0:2:5:0: [sdf] tag#374 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.351222] sd 0:2:5:0: [sdf] tag#374 CDB: Read(16) 88 00 00 00 00 01 d1 af f7 80 00 00 00 01 00 00
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.351223] print_req_error: I/O error, dev sdf, sector 7812937600
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.357678] Buffer I/O error on dev sdf1, logical block 7812935552, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.365606] sd 0:2:5:0: [sdf] tag#375 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.365607] sd 0:2:5:0: [sdf] tag#375 CDB: Read(16) 88 00 00 00 00 01 d1 af f7 81 00 00 00 01 00 00
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.365608] print_req_error: I/O error, dev sdf, sector 7812937601
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.372071] Buffer I/O error on dev sdf1, logical block 7812935553, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.380004] sd 0:2:5:0: [sdf] tag#376 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.380005] sd 0:2:5:0: [sdf] tag#376 CDB: Read(16) 88 00 00 00 00 01 d1 af f7 82 00 00 00 01 00 00
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.380006] print_req_error: I/O error, dev sdf, sector 7812937602
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.386444] Buffer I/O error on dev sdf1, logical block 7812935554, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.394393] sd 0:2:5:0: [sdf] tag#375 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.394396] sd 0:2:5:0: [sdf] tag#375 CDB: Read(16) 88 00 00 00 00 01 d1 af f7 83 00 00 00 05 00 00
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.394398] print_req_error: I/O error, dev sdf, sector 7812937603
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.400854] Buffer I/O error on dev sdf1, logical block 7812935555, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.408776] Buffer I/O error on dev sdf1, logical block 7812935556, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.416716] Buffer I/O error on dev sdf1, logical block 7812935557, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.424626] Buffer I/O error on dev sdf1, logical block 7812935558, async page read
Jun 27 01:54:20 thanos-be1003 kernel: [11723993.432549] Buffer I/O error on dev sdf1, logical block 7812935559, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.221404] sd 0:2:5:0: [sdf] tag#624 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.221411] sd 0:2:5:0: [sdf] tag#624 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 00 80 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.221414] print_req_error: I/O error, dev sdf, sector 2048
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.227444] sd 0:2:5:0: [sdf] tag#40 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.227450] sd 0:2:5:0: [sdf] tag#40 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.227453] print_req_error: I/O error, dev sdf, sector 2048
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.233383] Buffer I/O error on dev sdf1, logical block 0, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.240596] sd 0:2:5:0: [sdf] tag#40 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.240607] sd 0:2:5:0: [sdf] tag#40 CDB: Read(16) 88 00 00 00 00 00 00 00 08 01 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.240610] print_req_error: I/O error, dev sdf, sector 2049
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.246544] Buffer I/O error on dev sdf1, logical block 1, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.253704] sd 0:2:5:0: [sdf] tag#40 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.253707] sd 0:2:5:0: [sdf] tag#40 CDB: Read(16) 88 00 00 00 00 00 00 00 08 02 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.253708] print_req_error: I/O error, dev sdf, sector 2050
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.255184] sd 0:2:5:0: [sdf] tag#343 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.259638] Buffer I/O error on dev sdf1, logical block 2, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.259646] sd 0:2:5:0: [sdf] tag#343 CDB: Read(16) 88 00 00 00 00 00 00 00 08 04 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.266779] print_req_error: I/O error, dev sdf, sector 2052
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.266784] sd 0:2:5:0: [sdf] tag#41 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.272707] sd 0:2:5:0: [sdf] tag#41 CDB: Read(16) 88 00 00 00 00 00 00 00 08 03 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.272710] Buffer I/O error on dev sdf1, logical block 4, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.272755] sd 0:2:5:0: [sdf] tag#343 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.279849] print_req_error: I/O error, dev sdf, sector 2051
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.279851] sd 0:2:5:0: [sdf] tag#343 CDB: Read(16) 88 00 00 00 00 00 00 00 08 05 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.285784] print_req_error: I/O error, dev sdf, sector 2053
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.285785] Buffer I/O error on dev sdf1, logical block 3, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.298834] Buffer I/O error on dev sdf1, logical block 5, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.305991] sd 0:2:5:0: [sdf] tag#344 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.305992] sd 0:2:5:0: [sdf] tag#344 CDB: Read(16) 88 00 00 00 00 00 00 00 08 06 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.305993] print_req_error: I/O error, dev sdf, sector 2054
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.311946] Buffer I/O error on dev sdf1, logical block 6, async page read
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.319084] sd 0:2:5:0: [sdf] tag#343 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.319085] sd 0:2:5:0: [sdf] tag#343 CDB: Read(16) 88 00 00 00 00 00 00 00 08 07 00 00 00 01 00 00
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.319086] print_req_error: I/O error, dev sdf, sector 2055
Jun 27 01:54:38 thanos-be1003 kernel: [11724011.324999] Buffer I/O error on dev sdf1, logical block 7, async page read
Jun 27 02:01:02 thanos-be1003 kernel: [11724395.006235] XFS (sdf1): Unmounting Filesystem
Jun 27 02:08:17 thanos-be1003 kernel: [11724829.949583] sd 0:2:5:0: SCSI device is removed
Jun 27 02:09:38 thanos-be1003 kernel: [11724910.615545] megaraid_sas 0000:3b:00.0: 2243 (678074963s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 3

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

IIRC this already happened at least once. I failed to find the phab task right now, but IIRC one of the suggestion (might be mine or not) was to add an additional check to count the PV and alert if different of what is expected.
Now the problem that we had is that we don't have the expected amount of PV stored anywhere AFAIK. Going forward, with the work that dcops is doing towards standard builds, we most likely will have that information in Netbox and that could be added to the info we plan to pass down on Puppet, allowing to piece all the parts together.

In the meanwhile I'm not sure if there is a quick workaround that we could apply (like checking for an even number of disks, as it might not be true in the general case, or hardcode the expected value somewhere).

Thank you for the context, now I also recall a similar failure mode where we were wishing to have the number of expected disks! Indeed I'm not sure either there's a quick generic workaround. For swift specifically we do know the number of disks since puppet formats additional partitions post d-i. This fact also makes puppet fail on this particular issue (and broken disks) which means we won't miss further occurrences. I'll leave it up to you whether to use this task for tracking of the "expected number of disks" issue if needed

Volans triaged this task as Medium priority.Jul 1 2021, 10:35 AM

Ack, let's keep it around for now to explore what options we have.

@Marostegui Good question, I'm not aware of other occurrences of the same issue, so it can probably be closed. @fgiunchedi any thoughts?

fgiunchedi claimed this task.

Agreed, I'm not aware of further occurrences. I'll be BOLD and resolve the task, thank you!