Page MenuHomePhabricator

Broken disk on analytics1072
Closed, ResolvedPublic0 Estimated Story Points

Description

Hi Chris!

It seems that a disk failed on analytics1072:

[Mon Jun 24 17:05:47 2019] EXT4-fs warning (device sdb1): ext4_end_bio:314: I/O error -5 writing to inode 13540906 (offset 0 size 53248 starting block 54383723)
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383454
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383455
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383456
[Mon Jun 24 17:05:47 2019] EXT4-fs (sdb1): previous I/O error to superblock detected
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383457
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383458
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383459
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383460
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383461
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383462
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383463

Megacli reports the VD with status Optimal, and I don't see any media errors registered (yet), but the kernel already flagged the partition as read only. If possible I'd ask for a new disk to swap the broken one :)

Thanks!

Event Timeline

@elukey I am not sure which disk this? I think it's a smaller ssd? Can you confirm the disk type and size please ?

For some reason the disk doesn't show as failed by megacli but:

elukey@analytics1072:~$ ls /var/lib/hadoop/data/b
ls: reading directory '/var/lib/hadoop/data/b': Input/output error

This is one of the 4TB disks, not the smaller one:

/dev/sdb1                                  3.6T  2.5T  1.1T  70% /var/lib/hadoop/data/b

I tried to umount the disk to run fsck, since PDList doesn't show any failure, and I realized that the kernel seems not displaying anymore the device. I tried to get a list of WWN from megacli and compared them with /dev/disk/by-id (to find the only one missing and get the slot position from megacli) but the serials seem different..

You have successfully submitted request SR994463766 is the Dell ticket created. I did see the disk in megacli so I am not sure the TSR report I sent them will include the disk. I did include what you pasted in this ticket showing sdb as failed. Hopefully, that's enough to get a new disk shipped.

I received the disk on-site but I cannot tell which disk is failed, they all have green LEDs. @elukey could you please let me know which disk slot or let's coordinate to make the disk blink.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T15:37:01Z] <elukey> reboot analytics1072 as attempt to force the raid controller to set a drive failed - T226467

Seems to have worked:

elukey@analytics1072:~$ sudo megacli -PDList -aALL | grep "Firmware state"
Firmware state: Unconfigured(good), Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Enclosure Device ID: 32
Slot Number: 0
Enclosure position: 1
Device Id: 0
WWN: 5000c500afcc79ce
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: DB34
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b39ea373c0
Connected Port Number: 0(path0)
Inquiry Data:             ZC15A7CSST4000NM0265-2DC107                         DB34

@Cmjohnson let me know if you need more info or not :)

@elukey the disk has been replaced, it is in still unconfigured (good) the disk needs to be mapped back to Virtual Drive: 1 (Target Id: 1)
Slot Number: 0

This is a little out of my wheelhouse using megacli. I may be able to do in raid bios. I know there are several people that can probably help map the disk.

cmjohnson@analytics1072:~$ sudo megacli -LdPdInfo -aall | grep -e 'Virtual Drive' -e Slot
Virtual Drive: 0 (Target Id: 0)
Slot Number: 12
Slot Number: 13
Virtual Drive: 2 (Target Id: 2)
Slot Number: 1
Virtual Drive: 3 (Target Id: 3)
Slot Number: 2
Virtual Drive: 4 (Target Id: 4)
Slot Number: 3
Virtual Drive: 5 (Target Id: 5)
Slot Number: 4
Virtual Drive: 6 (Target Id: 6)
Slot Number: 5
Virtual Drive: 7 (Target Id: 7)
Slot Number: 6
Virtual Drive: 8 (Target Id: 8)
Slot Number: 7
Virtual Drive: 9 (Target Id: 9)
Slot Number: 8
Virtual Drive: 10 (Target Id: 10)
Slot Number: 9
Virtual Drive: 11 (Target Id: 11)
Slot Number: 10
Virtual Drive: 12 (Target Id: 12)
Slot Number: 11

Change 523851 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove host specific hiera settings for analytics1072

https://gerrit.wikimedia.org/r/523851

Change 523851 merged by Elukey:
[operations/puppet@production] Remove host specific hiera settings for analytics1072

https://gerrit.wikimedia.org/r/523851

@Cmjohnson thanks a lot! I had to reboot again to be able to configure the new PD, not really sure why (the megacli commands were failing before the reboot and succeeding afterwards). All good now!