Page MenuHomePhabricator

Broken disk on analytics1072
Closed, ResolvedPublic0 Estimated Story Points

Description

Hi Chris!

It seems that a disk failed on analytics1072:

[Mon Jun 24 17:05:47 2019] EXT4-fs warning (device sdb1): ext4_end_bio:314: I/O error -5 writing to inode 13540906 (offset 0 size 53248 starting block 54383723)
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383454
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383455
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383456
[Mon Jun 24 17:05:47 2019] EXT4-fs (sdb1): previous I/O error to superblock detected
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383457
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383458
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383459
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383460
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383461
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383462
[Mon Jun 24 17:05:47 2019] Buffer I/O error on device sdb1, logical block 54383463

Megacli reports the VD with status Optimal, and I don't see any media errors registered (yet), but the kernel already flagged the partition as read only. If possible I'd ask for a new disk to swap the broken one :)

Thanks!

Event Timeline

elukey created this task.Jun 25 2019, 6:23 AM
fdans moved this task from Incoming to Radar on the Analytics board.Jul 1 2019, 3:47 PM

@elukey I am not sure which disk this? I think it's a smaller ssd? Can you confirm the disk type and size please ?

elukey added a comment.EditedJul 12 2019, 5:57 AM

For some reason the disk doesn't show as failed by megacli but:

elukey@analytics1072:~$ ls /var/lib/hadoop/data/b
ls: reading directory '/var/lib/hadoop/data/b': Input/output error

This is one of the 4TB disks, not the smaller one:

/dev/sdb1                                  3.6T  2.5T  1.1T  70% /var/lib/hadoop/data/b

I tried to umount the disk to run fsck, since PDList doesn't show any failure, and I realized that the kernel seems not displaying anymore the device. I tried to get a list of WWN from megacli and compared them with /dev/disk/by-id (to find the only one missing and get the slot position from megacli) but the serials seem different..

You have successfully submitted request SR994463766 is the Dell ticket created. I did see the disk in megacli so I am not sure the TSR report I sent them will include the disk. I did include what you pasted in this ticket showing sdb as failed. Hopefully, that's enough to get a new disk shipped.

Disks is on it's way

I received the disk on-site but I cannot tell which disk is failed, they all have green LEDs. @elukey could you please let me know which disk slot or let's coordinate to make the disk blink.

Mentioned in SAL (#wikimedia-operations) [2019-07-16T15:37:01Z] <elukey> reboot analytics1072 as attempt to force the raid controller to set a drive failed - T226467

Seems to have worked:

elukey@analytics1072:~$ sudo megacli -PDList -aALL | grep "Firmware state"
Firmware state: Unconfigured(good), Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

Enclosure Device ID: 32
Slot Number: 0
Enclosure position: 1
Device Id: 0
WWN: 5000c500afcc79ce
Sequence Number: 1
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 3.638 TB [0x1d1c0beb0 Sectors]
Non Coerced Size: 3.637 TB [0x1d1b0beb0 Sectors]
Coerced Size: 3.637 TB [0x1d1b00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up
Device Firmware Level: DB34
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x500056b39ea373c0
Connected Port Number: 0(path0)
Inquiry Data:             ZC15A7CSST4000NM0265-2DC107                         DB34

@Cmjohnson let me know if you need more info or not :)

@elukey the disk has been replaced, it is in still unconfigured (good) the disk needs to be mapped back to Virtual Drive: 1 (Target Id: 1)
Slot Number: 0

This is a little out of my wheelhouse using megacli. I may be able to do in raid bios. I know there are several people that can probably help map the disk.

cmjohnson@analytics1072:~$ sudo megacli -LdPdInfo -aall | grep -e 'Virtual Drive' -e Slot
Virtual Drive: 0 (Target Id: 0)
Slot Number: 12
Slot Number: 13
Virtual Drive: 2 (Target Id: 2)
Slot Number: 1
Virtual Drive: 3 (Target Id: 3)
Slot Number: 2
Virtual Drive: 4 (Target Id: 4)
Slot Number: 3
Virtual Drive: 5 (Target Id: 5)
Slot Number: 4
Virtual Drive: 6 (Target Id: 6)
Slot Number: 5
Virtual Drive: 7 (Target Id: 7)
Slot Number: 6
Virtual Drive: 8 (Target Id: 8)
Slot Number: 7
Virtual Drive: 9 (Target Id: 9)
Slot Number: 8
Virtual Drive: 10 (Target Id: 10)
Slot Number: 9
Virtual Drive: 11 (Target Id: 11)
Slot Number: 10
Virtual Drive: 12 (Target Id: 12)
Slot Number: 11

Change 523851 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove host specific hiera settings for analytics1072

https://gerrit.wikimedia.org/r/523851

Change 523851 merged by Elukey:
[operations/puppet@production] Remove host specific hiera settings for analytics1072

https://gerrit.wikimedia.org/r/523851

elukey closed this task as Resolved.Jul 17 2019, 6:44 AM

@Cmjohnson thanks a lot! I had to reboot again to be able to configure the new PD, not really sure why (the megacli commands were failing before the reboot and succeeding afterwards). All good now!

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM