Page MenuHomePhabricator

ms-be1043 sdk failed
Closed, ResolvedPublic

Description

The disk sdk failed on ms-be1043 earlier today, I began investigating why an automatic task wasn't opened and it looks like the PD isn't there at all (see below) nor is the LD.

I believe this a new failure mode, at least I don't remember seeing something like this before. cc @Volans

Mar 18 01:49:08 ms-be1043 kernel: [12660732.334403] sd 0:2:10:0: [sdk] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:08 ms-be1043 kernel: [12660732.334425] sd 0:2:10:0: [sdk] tag#0 CDB: Read(16) 88 00 00 00 00 01 b0 56 7c a0 00 00 00 08 00 00
Mar 18 01:49:08 ms-be1043 kernel: [12660732.334432] blk_update_request: I/O error, dev sdk, sector 7253425312
Mar 18 01:49:08 ms-be1043 kernel: [12660732.341990] XFS (sdk1): metadata I/O error: block 0x1b05674a0 ("xfs_trans_read_buf_map") error 5 numblks 8
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450147] sd 0:2:10:0: [sdk] tag#56 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450154] sd 0:2:10:0: [sdk] tag#56 CDB: Read(16) 88 00 00 00 00 01 37 ea 7a c0 00 00 00 08 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450158] blk_update_request: I/O error, dev sdk, sector 5233081024
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450171] sd 0:2:10:0: [sdk] tag#53 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450176] sd 0:2:10:0: [sdk] tag#53 CDB: Read(16) 88 00 00 00 00 01 00 b0 e8 40 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450179] blk_update_request: I/O error, dev sdk, sector 4306561088
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450189] sd 0:2:10:0: [sdk] tag#51 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450192] sd 0:2:10:0: [sdk] tag#47 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450195] sd 0:2:10:0: [sdk] tag#46 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450198] sd 0:2:10:0: [sdk] tag#43 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450200] sd 0:2:10:0: [sdk] tag#44 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450203] sd 0:2:10:0: [sdk] tag#51 CDB: Read(16) 88 00 00 00 00 01 72 0d 09 a0 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450204] sd 0:2:10:0: [sdk] tag#47 CDB: Read(16) 88 00 00 00 00 01 7d f3 9e e0 00 00 00 08 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450207] sd 0:2:10:0: [sdk] tag#46 CDB: Read(16) 88 00 00 00 00 01 01 96 40 20 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450209] sd 0:2:10:0: [sdk] tag#43 CDB: Read(16) 88 00 00 00 00 00 9a 24 16 a0 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450212] blk_update_request: I/O error, dev sdk, sector 6208424352
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450214] sd 0:2:10:0: [sdk] tag#44 CDB: Read(16) 88 00 00 00 00 00 c8 f4 4f c0 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450218] blk_update_request: I/O error, dev sdk, sector 6408085216
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450222] blk_update_request: I/O error, dev sdk, sector 4321591328
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450225] blk_update_request: I/O error, dev sdk, sector 2586056352
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450227] blk_update_request: I/O error, dev sdk, sector 3371454400
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450273] XFS (sdk1): metadata I/O error: block 0x100b0e040 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450279] XFS (sdk1): metadata I/O error: block 0x17df396e0 ("xfs_trans_read_buf_map") error 5 numblks 8
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450285] XFS (sdk1): metadata I/O error: block 0xc8f447c0 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450290] XFS (sdk1): metadata I/O error: block 0x101963820 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450294] XFS (sdk1): metadata I/O error: block 0x9a240ea0 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450298] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450303] XFS (sdk1): metadata I/O error: block 0x1720d01a0 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450306] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450309] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450312] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.450315] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.451937] sd 0:2:10:0: [sdk] tag#68 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.451941] sd 0:2:10:0: [sdk] tag#68 CDB: Read(16) 88 00 00 00 00 00 bf ab d3 e0 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.451944] blk_update_request: I/O error, dev sdk, sector 3215709152
Mar 18 01:49:09 ms-be1043 kernel: [12660733.452007] XFS (sdk1): metadata I/O error: block 0xbfabcbe0 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.452014] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.452939] megaraid_sas 0000:02:00.0: scanning for scsi0...
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453297] megaraid_sas 0000:02:00.0: 1450 (606188930s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 0a/a
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453510] megaraid_sas 0000:02:00.0: 1451 (606188930s/0x0001/FATAL) - VD 0a/a is now OFFLINE
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453682] sd 0:2:10:0: [sdk] tag#191 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453685] blk_update_request: I/O error, dev sdk, sector 5301335056
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453687] sd 0:2:10:0: [sdk] tag#191 CDB: Read(16) 88 00 00 00 00 01 35 30 f9 e0 00 00 00 20 00 00
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453742] XFS (sdk1): metadata I/O error: block 0x13530f1e0 ("xfs_trans_read_buf_map") error 5 numblks 32
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453749] XFS (sdk1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453757] XFS (sdk1): metadata I/O error: block 0x13bfbec10 ("xfs_trans_read_buf_map") error 5 numblks 8
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453955] XFS (sdk1): metadata I/O error: block 0xe8ea413c ("xlog_iodone") error 5 numblks 64
Mar 18 01:49:09 ms-be1043 kernel: [12660733.453960] XFS (sdk1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-IWeKxA/linux-4.9.110/fs/xfs/xfs_log.c.  Return address = 0xffffffffc0aca882
Mar 18 01:49:09 ms-be1043 kernel: [12660733.454107] XFS (sdk1): Log I/O Error Detected.  Shutting down filesystem
Mar 18 01:49:09 ms-be1043 kernel: [12660733.454108] XFS (sdk1): Please umount the filesystem and rectify the problem(s)
Mar 18 01:49:19 ms-be1043 kernel: [12660743.342773] XFS (sdk1): Unmounting Filesystem
Mar 18 01:53:09 ms-be1043 kernel: [12660973.025021] megaraid_sas 0000:02:00.0: 1470 (606189171s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 8
root@ms-be1043:~# megacli -PDList -aALL | grep -e 'Slot Number' -e state:
Slot Number: 0
Firmware state: Online, Spun Up
Slot Number: 1
Firmware state: Online, Spun Up
Slot Number: 2
Firmware state: Online, Spun Up
Slot Number: 3
Firmware state: Online, Spun Up
Slot Number: 4
Firmware state: Online, Spun Up
Slot Number: 5
Firmware state: Online, Spun Up
Slot Number: 6
Firmware state: Online, Spun Up
Slot Number: 7
Firmware state: Online, Spun Up
Slot Number: 9
Firmware state: Online, Spun Up
Slot Number: 10
Firmware state: Online, Spun Up
Slot Number: 11
Firmware state: Online, Spun Up
Slot Number: 12
Firmware state: Online, Spun Up
Slot Number: 13
Firmware state: Online, Spun Up
root@ms-be1043:~# /usr/local/lib/nagios/plugins/
check_dpkg                check_ipmi_sensor         check_newest_file_age     check_raid                check_systemd_unit_state  
check_eth                 check_long_procs          check_puppetrun           check_systemd_state       get-raid-status-megacli   
root@ms-be1043:~# /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
=== RaidStatus completed
root@ms-be1043:~# grep sdk /proc/partitions 
root@ms-be1043:~#
root@ms-be1043:~# megacli -AdpAllInfo -aALL
...
                Device Present
                ================
Virtual Drives    : 13 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 13 
  Critical Disks  : 0 
  Failed Disks    : 0

Event Timeline

@fgiunchedi agree that this is a new issue, and we need to fix two different scripts to have an automatic task created for this:

  1. The check_raid script

The current check_raid script is not alarming in this case, and without an alarm the raid handler will never be called:

$ /usr/bin/sudo /usr/local/lib/nagios/plugins/check_raid megacli
OK: optimal, 13 logical, 13 physical
OK

Currently the only alarm we have on that host is from the failed puppet runs.

  1. the get-raid-status-megacli script

In order to keep track of the IDs. Recently I added the case on which the ID is present but no info for the drive are present.

Is there anything in the hardware logs that might have "disabled" completely the slot?

Also worth mentioning that tracking the IDs for missing ones is not enough because if the last one fails we should know in advance how many are supposed to be there.

CDanis added a subscriber: CDanis.Mar 18 2019, 3:04 PM

@fgiunchedi we should set this device to 0 weight in the rings, yes? Happy to do the change if you'll review

Change 497557 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/software/swift-ring@master] eqiad-prod: 0 weight to ms-be1043/sdk1

https://gerrit.wikimedia.org/r/497557

jijiki triaged this task as Normal priority.Mar 19 2019, 6:08 PM

@fgiunchedi we should set this device to 0 weight in the rings, yes? Happy to do the change if you'll review

For short / known disk failures I've generally avoided pulling disks from the ring, however in this case I won't have time to dedicate to diagnosing this further and there's no ETR now. So yes safer to pull the disk from the ring instead!

Change 497557 merged by CDanis:
[operations/software/swift-ring@master] eqiad-prod: 0 weight to ms-be1043/sdk1

https://gerrit.wikimedia.org/r/497557

You have successfully submitted request SR988320478.

A disk has been ordered

Dzahn removed a subscriber: Dzahn.Mar 27 2019, 12:10 PM

You have successfully submitted request SR988320478.
A disk has been ordered

Thanks! Are there signs the disk has failed on the hardware itself like blinking leds? From the OS/controller it looks like the physical disk just disappeared.

Any updates on this @Cmjohnson ?

Change 508855 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] raid: update check_raid to detect missing disk

https://gerrit.wikimedia.org/r/508855

jbond added a subscriber: jbond.May 8 2019, 5:49 PM

@fgiunchedi agree that this is a new issue, and we need to fix two different scripts to have an automatic task created for this:

  1. The check_raid script

The current check_raid script is not alarming in this case, and without an alarm the raid handler will never be called:

has a first pass at this

ms-be1043 ~ % sudo ./test.py                                                                 [15:59:38]
CRITICAL: Devices (14) not equal to  PDs (13)
fgiunchedi moved this task from Backlog to In progress on the observability board.May 13 2019, 3:14 PM

Change 508855 merged by Jbond:
[operations/puppet@production] raid: update check_raid to detect missing disk

https://gerrit.wikimedia.org/r/508855

jbond added a subscriber: jcrespo.EditedMay 14 2019, 11:54 AM

The change above failed as the phsicalDrive regex dose not take into account the the number of drives per span. however further investigation shows we have this discrepancy on other servers

db2096 -

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 0 
  Offline         : 0 
Physical Devices  : 11 
  Disks           : 10 
  Critical Disks  : 0 
  Failed Disks    : 0

@jcrespo confirmed this machine only has 10 disks

12:03 < jynus> jbond42 just to double check, db2096 is a
                          https://www.dell.com/en-us/work/shop/povw/poweredge-r440 where
                          physically no more than 10 drives fit there
This comment was removed by Cmjohnson.
Volans moved this task from In progress to Radar on the observability board.Jun 24 2019, 3:04 PM

This is a dell server, I will try and put in a ticket with Dell but all h/w is showing that there isn't a problem so I may have trouble with Dell giving me a disk for an issue that they do not think exists.

cmjohnson@ms-be1043:~$ sudo megacli -PDList -aALL |grep "Firmware state: Online"
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
cmjohnson@ms-be1043:~$ sudo megacli -PDList -aALL |grep "Predictive Failure Count"
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0
Predictive Failure Count: 0

here is the Dell task You have successfully submitted request SR994463101.

jcrespo removed a subscriber: jcrespo.Jul 15 2019, 6:51 AM

This is a dell server, I will try and put in a ticket with Dell but all h/w is showing that there isn't a problem so I may have trouble with Dell giving me a disk for an issue that they do not think exists.

Indeed it seems sdk is now back together with its PD/LD and is mounted, I'm assuming following a reboot for kernel upgrades 40 days ago. The disk doesn't have weight in swift at the moment so we can diagnose as needed.

Thanks, @godog is there any way you can put some stress on that disk? It's hard for me to justify to Dell that we need a disk replacement when it shows that it is working

@godog, no worries about the earlier comment. Dell approved the disk replacement. I will update task once it's been replaced.

@godog I did get the new disk but since it's not failed...I am not sure which disk is actually bad on my end. Do you know which slot the disk is in or let's coordinate and see if you can make the disk blink.

@Cmjohnson should be blinking now

root@ms-be1043:~# ls -la /dev/disk/by-path/ | grep -i sdk$
lrwxrwxrwx 1 root root   9 May 30 11:06 pci-0000:02:00.0-scsi-0:2:10:0 -> ../../sdk
# LD 10
root@ms-be1043:~# megacli -LdPdInfo -aall | grep -e 'Virtual Drive' -e Slot
Virtual Drive: 0 (Target Id: 0)
Slot Number: 12
Virtual Drive: 1 (Target Id: 1)
Slot Number: 13
Virtual Drive: 2 (Target Id: 2)
Slot Number: 0
Virtual Drive: 3 (Target Id: 3)
Slot Number: 1
Virtual Drive: 4 (Target Id: 4)
Slot Number: 2
Virtual Drive: 5 (Target Id: 5)
Slot Number: 3
Virtual Drive: 6 (Target Id: 6)
Slot Number: 4
Virtual Drive: 7 (Target Id: 7)
Slot Number: 5
Virtual Drive: 8 (Target Id: 8)
Slot Number: 6
Virtual Drive: 9 (Target Id: 9)
Slot Number: 7
Virtual Drive: 10 (Target Id: 10)
Slot Number: 8
Virtual Drive: 11 (Target Id: 11)
Slot Number: 9
Virtual Drive: 12 (Target Id: 12)
Slot Number: 10
Virtual Drive: 13 (Target Id: 13)
Slot Number: 11
root@ms-be1043:~# megacli -PdLocate -start -physdrv \[32:8\] -aALL
                                     
Adapter: 0: Device at EnclId-32 SlotId-8  -- PD Locate Start Command was successfully sent to Firmware
Cmjohnson closed this task as Resolved.Jul 16 2019, 8:07 PM

@godog replaced the disk, all should be good now. If you find that it's not please re-open the task and ping me.

Mentioned in SAL (#wikimedia-operations) [2019-07-17T07:48:22Z] <godog> swift eqiad-prod: put back ms-be1043 sdk1 - T218544

Mentioned in SAL (#wikimedia-operations) [2019-07-18T11:02:11Z] <godog> swift eqiad-prod: put back ms-be1043 sdk1 - T218544