Page MenuHomePhabricator

Disk failure on an-worker1110
Closed, ResolvedPublic

Assigned To
Authored By
BTullis
May 18 2023, 10:21 AM
Referenced Files
F37009338: image.png
May 18 2023, 11:05 AM
Unknown Object (File)
May 18 2023, 10:47 AM
Subscribers

Description

One of the hard drives in an-worker1110 has failed. This is currently /dev/sdf

The dmesg logs are full of evidence, such as:

[Sat May 13 10:20:43 2023] megaraid_sas 0000:3b:00.0: 4723927 (737288548s/0x0002/FATAL) - Unrecoverable medium error during recovery on PD 04(e0x20/s4) at 19d9ffb14

[Sat May 13 10:20:52 2023] sd 0:2:5:0: [sdf] tag#57 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sat May 13 10:20:52 2023] sd 0:2:5:0: [sdf] tag#57 Sense Key : Medium Error [current] 

[Sat May 13 12:24:55 2023] megaraid_sas 0000:3b:00.0: 4725520 (737296028s/0x0021/FATAL) - Controller cache pinned for missing or offline VD 05/4
[Sat May 13 12:24:55 2023] megaraid_sas 0000:3b:00.0: 4725521 (737296028s/0x0001/FATAL) - VD 05/4 is now OFFLINE
[Sat May 13 12:26:13 2023] megaraid_sas 0000:3b:00.0: 4725550 (737296111s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 4

I will raise a hardware troubleshooting ticket to get the drive replaced.

Event Timeline

Icinga downtime and Alertmanager silence (ID=ecb8d071-fb01-4316-86ae-2c9d839d7dc0) set by btullis@cumin1001 for 4:00:00 on 1 host(s) and their services with reason: Troubleshooting failed disk

an-worker1110.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2023-05-18T10:31:53Z] <btullis> cold booting an-worker1110 to troubleshoot drive failure T336929

I'm going to power it down and cold boot it, just for good measure. megacli didn't even report the presence of a physical drive in slot 4 so I would like to be able to boot it from cold in order to get it to report that the drive has failed.

btullis@cumin1001:~$ sudo ipmitool -I lanplus -H "an-worker1110.mgmt.eqiad.wmnet" -U root -E shell
Unable to read password from environment
Password: 
ipmitool> chassis power status
Chassis Power is off
ipmitool> chassis power on
Chassis Power Control: Up/On
ipmitool> sol activate
[SOL Session operational.  Use ~? for help]

Nicely scrambled error messages from the RAID controller on boot.
{F37009196,width=80%}

It detected a drive, but reported it as a foreign configuration. I'll give it another go to see if it can get the raid 0 config back, before call in reinforcements.

No, even after re-importing the foriegn configuration , the drive still fails to mount and blocks a boot.

image.png (878×773 px, 216 KB)

I will now raise this with DC-Ops

I have temporarily commented out /dev/sdf from the /etc/fstab to allow a clean boot.

I marked the drive as offline and configured the slot to blink, facilitating replacement.

sudo megacli -PDOffline -PhysDrv '[32:4]' -a0
sudo megacli -PDLocate -PhysDrv '[32:4]' -a0

Marked as waiting, until T336930: hw troubleshooting: disk replacement for an-worker1110.eqiad.wmnet is complete.

I tried adding the disk back in using the existing LD number.

btullis@an-worker1110:~$ sudo megacli -CfgLdAdd -r0 '[32:4]' -AfterLd2 -a0
                                     

Adapter 0: Configure Adapter Failed

FW error description: 
  The current operation is not allowed because the controller has data in cache for offline or missing virtual drives.  

Exit Code: 0x54

We can see that the cache of LD2 was preserved.

btullis@an-worker1110:~$ sudo megacli -GetPreservedCacheList -a0
                                     
Adapter #0

Virtual Drive(Target ID 03): Missing.

Exit Code: 0x00

Discarded this preserved cache.

btullis@an-worker1110:~$ sudo megacli -DiscardPreservedCache -L3 -a0
                                     
Adapter #0

Virtual Drive(Target ID 03): Preserved Cache Data Cleared.

Exit Code: 0x00

Tried again to add the disk.

btullis@an-worker1110:~$ sudo megacli -CfgLdAdd -r0 '[32:4]' -AfterLd2 -a0
                                     
Adapter 0: Created VD 3

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

Success!

The new device is detected by the operating system as /dev/sdf

[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] 7812939776 512-byte logical blocks: (4.00 TB/3.64 TiB)
[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] Write Protect is off
[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] Mode Sense: 1f 00 00 08
[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: Attached scsi generic sg3 type 0
[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] Attached SCSI disk

I used the following commands to initialize /dev/sdf

sudo parted /dev/sdf --script mklabel gpt
sudo parted /dev/sdf --script mkpart primary ext4 0% 100%
sudo mkfs.ext4 -L hadoop-f /dev/sdf1
sudo tune2fs -m 0 /dev/sdf1

I obtained the uuid for the file system from the command:

sudo blkid /dev/sdf1

...then edited /etc/fstab to update the entry for /var/lib/hadoop/data/f and use this UUID value.

It mounts cleanly, but unfortunately it looks like another drive on the same host is also having problems.

btullis@an-worker1110:~$ sudo mount -a
mount: /var/lib/hadoop/data/d: can't find UUID=0addaf32-809a-4172-bc3c-e89933f358f3.

Checking dmesg -T confirms that this disk is also having a hard time. I can reboot the node and see if reinitializing it helps, but I suspect that this drive might also need replacing.

Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors with hard drive

I have fixed the errors with the other drive on this host. It didn't boot, so I had to comment out /var/lib/hadoop/data/d from /etc/fstab and reboot again.
I then checked the state of the physical disks with:

btullis@an-worker1110:~$ sudo megacli -PDList -aall|grep state
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Unconfigured(good), Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up

There was no preserved cache:

btullis@an-worker1110:~$ sudo megacli -GetPreservedCacheList -a0
                                     
Adapter 0: No Virtual Drive has Preserved Cache Data.

Exit Code: 0x00

Checking with sudo megacli -LDInfo -LAll -aAll I could see that there was no Logical Device 5, so I recreated this RAID0 device with:

btullis@an-worker1110:~$ sudo megacli -CfgLdAdd -r0 '[32:2]' -AfterLD4 -a0
                                     
Adapter 0: Created VD 5

Adapter 0: Configured the Adapter!!

Exit Code: 0x00

This was then passed through to the O/S as /dev/sdm.

btullis@an-worker1110:~$ sudo dmesg -T | tail
[Fri Jun 23 09:35:51 2023] Process accounting resumed
[Fri Jun 23 09:57:09 2023] perf: interrupt took too long (2616 > 2500), lowering kernel.perf_event_max_sample_rate to 76250
[Fri Jun 23 10:01:05 2023] perf: interrupt took too long (3387 > 3270), lowering kernel.perf_event_max_sample_rate to 59000
[Fri Jun 23 10:08:45 2023] scsi 0:2:5:0: Direct-Access     DELL     PERC H730P Mini  4.30 PQ: 0 ANSI: 5
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: Attached scsi generic sg12 type 0
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: [sdm] 7812939776 512-byte logical blocks: (4.00 TB/3.64 TiB)
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: [sdm] Write Protect is off
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: [sdm] Mode Sense: 1f 00 00 08
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: [sdm] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[Fri Jun 23 10:08:45 2023] sd 0:2:5:0: [sdm] Attached SCSI disk

I checked for a partition table, but there wasn't one, so I went ahead and made one:

btullis@an-worker1110:~$ sudo parted /dev/sdm
GNU Parted 3.2
Using /dev/sdm
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) p                                                                
Error: /dev/sdm: unrecognised disk label
Model: DELL PERC H730P Mini (scsi)                                        
Disk /dev/sdm: 4000GB
Sector size (logical/physical): 512B/512B
Partition Table: unknown
Disk Flags: 
(parted) mklabel gpt
(parted) mkpart primary ext4 0% 100%                                      
(parted) q

I created a file system, set the reserved-blocks value to 0%, obtained the UUID.

btullis@an-worker1110:~$ sudo mkfs.ext4 -L hadoop-d /dev/sdm1 
mke2fs 1.44.5 (15-Dec-2018)
Creating filesystem with 976616960 4k blocks and 244154368 inodes
Filesystem UUID: ed5a52ad-efd0-49f1-85d5-8202ceaf2b1a
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done       

btullis@an-worker1110:~$ sudo tune2fs -m 0 /dev/sdm1
tune2fs 1.44.5 (15-Dec-2018)
Setting reserved blocks percentage to 0% (0 blocks)
btullis@an-worker1110:~$ sudo blkid /dev/sdm1
/dev/sdm1: LABEL="hadoop-d" UUID="ed5a52ad-efd0-49f1-85d5-8202ceaf2b1a" TYPE="ext4" PARTLABEL="primary" PARTUUID="cb052806-24b5-4d64-85d0-86c0631880b2"

...then updated /etc/fstab with this UUID value and mounted the volume.

I'll reboot it once more to check that it boot properly and this should initialize the new drive with HDFS.

Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors with hard drive

Mentioned in SAL (#wikimedia-analytics) [2023-06-23T10:20:42Z] <btullis> reboot an-worker1110 after initializing a second replacement drive for T336929