Page MenuHomePhabricator

ms-be2043 'sdd' throwing lots of errors
Closed, ResolvedPublic

Description

on May 1st it threw one 'medium error' but it has been throwing many errors (for the same sector) since.

syslog:May  6 19:07:07 ms-be2043 kernel: [1663397.029420] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog:May  6 19:07:07 ms-be2043 kernel: [1663397.143311] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog:May  6 19:07:07 ms-be2043 kernel: [1663397.233289] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:10:13 ms-be2043 kernel: [1299957.031634] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:10:13 ms-be2043 kernel: [1299957.071272] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:10:13 ms-be2043 kernel: [1299957.124340] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:10:13 ms-be2043 kernel: [1299957.161142] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:17:48 ms-be2043 kernel: [1300412.076706] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:17:48 ms-be2043 kernel: [1300412.122724] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:17:48 ms-be2043 kernel: [1300412.171265] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:31 ms-be2043 kernel: [1300454.365840] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:31 ms-be2043 kernel: [1300454.425160] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:31 ms-be2043 kernel: [1300454.469737] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:31 ms-be2043 kernel: [1300454.519819] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:31 ms-be2043 kernel: [1300455.129463] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:42 ms-be2043 kernel: [1300466.079200] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:42 ms-be2043 kernel: [1300466.147393] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:42 ms-be2043 kernel: [1300466.188098] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.228179] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.263569] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.301320] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.348356] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.382976] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.424650] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:18:43 ms-be2043 kernel: [1300466.465320] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:19:28 ms-be2043 kernel: [1300511.565577] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:37:52 ms-be2043 kernel: [1301616.143616] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:37:55 ms-be2043 kernel: [1301619.220298] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:37:56 ms-be2043 kernel: [1301619.323860] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:37:56 ms-be2043 kernel: [1301619.804557] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:45:17 ms-be2043 kernel: [1302060.817209] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:45:54 ms-be2043 kernel: [1302097.933325] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:45:54 ms-be2043 kernel: [1302097.972084] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:45:54 ms-be2043 kernel: [1302098.010523] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:46:02 ms-be2043 kernel: [1302106.211823] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:53:31 ms-be2043 kernel: [1302555.164794] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:53:31 ms-be2043 kernel: [1302555.263114] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:53:31 ms-be2043 kernel: [1302555.346404] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:55:06 ms-be2043 kernel: [1302649.635148] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:55:06 ms-be2043 kernel: [1302649.674508] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:55:06 ms-be2043 kernel: [1302649.721117] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 14:58:27 ms-be2043 kernel: [1302850.729533] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 15:00:23 ms-be2043 kernel: [1302967.263219] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 15:00:23 ms-be2043 kernel: [1302967.301453] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 15:00:23 ms-be2043 kernel: [1302967.339690] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.4.gz:May  2 15:01:43 ms-be2043 kernel: [1303046.686366] blk_update_request: I/O error, dev sdd, sector 4432735456
syslog.5.gz:May  1 18:07:30 ms-be2043 kernel: [1227788.892047] blk_update_request: I/O error, dev sdd, sector 4432735456

Not sure if there's a usual procedure we have here. Does writing 0s over that sector (in an attempt to get the drive firmware to remap it), and then if successful, making a fresh filesystem on the disk (and then letting replication do its thing) make sense?

Event Timeline

Dzahn triaged this task as Normal priority.May 6 2019, 10:22 PM
Dzahn added a subscriber: Dzahn.May 6 2019, 10:32 PM

other tickets where ms-be disks died with "blk_update _request: I/O error" or similar.

T184053 , T183896, T218544, T136395, T163690, T166021

Afaict, we have generally just replaced the disks. (unless a reboot fixed a controller that went rogue)

Indeed usually it is a disk replacement, under warranty in this case

fgiunchedi assigned this task to Papaul.May 10 2019, 9:15 AM

We're seeing error on this disk on slot 3 on this host, could we get it replaced under warranty? Thanks!

The drive should be blinking: root@ms-be2043:~# megacli -PdLocate -start -physdrv \[32:3\] -aALL

Create Dispatch: Success
You have successfully submitted request SR990663287.

Papaul reassigned this task from Papaul to fgiunchedi.May 16 2019, 2:30 PM
Papaul added a subscriber: Papaul.

@fgiunchedi disk replaced

Thanks @Papaul! Turns out I gave you wrong instructions, and sdf is in slot 3 not sdd :(
Not a huge problem on the swift side, we'll have to figure out the right mappings device <-> slot

@fgiunchedi so what do you want to do here? I still have the old disk with me. Do you want me to keep it with me and not ship it back to Dell for now?

@fgiunchedi so what do you want to do here? I still have the old disk with me. Do you want me to keep it with me and not ship it back to Dell for now?

Yes please keep the disk for now! Thank you

Trying to debug how slot 3 on the controller wasn't in fact mapped to sdd. In the past (on older hw generations?) the scsi address to which each disk was mapped to corresponded to the slot on the raid controller, in this case:

# ls -la /dev/disk/by-path/ | grep /sdd$
lrwxrwxrwx 1 root root   9 May 16 14:47 pci-0000:02:00.0-scsi-0:2:3:0 -> ../../sdd

Thus I thought the slot was 3, turns out though that upon replacement sdf disappeared/gave errors, and indeed now slot 3 is unconfigured (good):

Enclosure Device ID: 32
Slot Number: 3
Enclosure position: 1
Device Id: 3
WWN: 5000c500b333a0a7
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: DB34
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b318b836c3
Connected Port Number: 0(path0) 
Inquiry Data:             XXX
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :31C (87.80 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Drive's NCQ setting : N/A
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Drive has flagged a S.M.A.R.T alert : No

I thought the "drive's position" in megacli might provide the mapping to scsi target, but that doesn't seem to be the case, i.e. in the current situation the drives groups have shifted (sdf the disk we just replaced doesn't have a group)

# megacli -PDList -aALL | grep -i -e 'firmware state:' -e 's position'
Drive's position: DiskGroup: 2, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 3, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 4, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Firmware state: Unconfigured(good), Spun Up
Drive's position: DiskGroup: 5, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 6, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 7, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 8, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 9, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 10, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 11, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 12, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Firmware state: Online, Spun Up
Drive's position: DiskGroup: 1, Span: 0, Arm: 0
Firmware state: Online, Spun Up
fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 21 2019, 9:59 AM
faidon added a subscriber: faidon.May 23 2019, 12:14 AM

I'm not at all sure, but I don't see an LD 5 at all. Is it possible that instead of remaining as a degraded LD (with a failed disk) it got removed entirely somehow and that's what's causing the renumbering of LDs > 6 to smaller sd letters?

I also don't see what disk groups would have to do with this? I'd expect the SCSI devices to map 1:1 to LDs, not PDs or disk groups, but I may be missing something here too :)

Thanks for taking a look!

I'm not at all sure, but I don't see an LD 5 at all. Is it possible that instead of remaining as a degraded LD (with a failed disk) it got removed entirely somehow and that's what's causing the renumbering of LDs > 6 to smaller sd letters?

It is possible that LD 5 isn't there at all because of replacement of sdf for sure, once I noticed the mistake (sdd vs sdf) I stopped operating on the raid controller config to not murk the water. I suspect issuing -CfgEachDiskRaid0 will bring LD 5 back though. Also to clarify, in this instance there's no renumbering (yet?) because we didn't reboot the host.

I also don't see what disk groups would have to do with this? I'd expect the SCSI devices to map 1:1 to LDs, not PDs or disk groups, but I may be missing something here too :)

In past megaraid configurations (i.e. legacy, where SSDs were last) we were able to map scsi devices 1:1 to PDs too, although that's clearly not the case anymore! Going through LDs indeed the mapping is there as expected

Mentioned in SAL (#wikimedia-operations) [2019-05-28T15:53:19Z] <godog> put back wrongly-replaced sdf on ms-be2043 - T222654

Disk has been replaced, thanks @Papaul !

Return information below

Mentioned in SAL (#wikimedia-operations) [2019-05-29T07:40:54Z] <godog> ms-be2043 start sdd rebuild - T222654

fgiunchedi closed this task as Resolved.Jun 13 2019, 1:33 PM

All done, resolving.