In https://phabricator.wikimedia.org/T280132, a defective disk was replaced, and now we can re-add the disk to the host on the software side.
Description
Related Objects
- Mentioned Here
- T280132: Degraded RAID on an-worker1100
Event Timeline
@elukey I was following the instructions at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk but I got a nonzero exit code.
Based on this comment:
# From the previous commands you should be able to fill in the variables # with the values of the disk's properties indicated below: # X => Enclosure Device ID # Y => Slot Number # Z => Controller (Adapter) number megacli -PDMakeGood -PhysDrv[X:Y] -aZ
And the output:
Adapter #0 ... Enclosure Device ID: 32 Slot Number: 11 Firmware state: Online, Spun Up
For step 6:
Add the single disk RAID0 array (use the details from the steps above):
sudo megacli -CfgLdAdd -r0 [32:0] -a0
I ran sudo megacli -CfgLdAdd -r0 [32:11] -a0 but got output:
Exit Code: 0x1a
And I see an odd number of disks, which makes me think the disk is still missing:
razzi@an-worker1100:~$ ls /dev/sd? | wc -w 23
On this webpage: https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages it has:
0x1a Maximum LDs are already configured
Do you know what that means?
@razzi you have the wrong slot, it is the 10th :)
Enclosure Device ID: 32 Slot Number: 10 Enclosure position: 1 Device Id: 10 WWN: 5000c500c9829a03 Sequence Number: 7 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 1.819 TB [0xe8e088b0 Sectors] Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors] Coerced Size: 1.818 TB [0xe8d00000 Sectors] Sector Size: 512 Logical Sector Size: 512 Physical Sector Size: 512 Firmware state: Unconfigured(good), Spun Up <==================
Check the output of sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state", Online etc.. is the regular state.
Ok, it looks like everything is working here, but disk usage is still at 0%:
NAME FSTYPE LABEL UUID FSAVAIL FSUSE% MOUNTPOINT ... sdl1 ext4 hadoop-k cb58c727-dec9-4abf-8b21-3d70a6443b6d 1.8T 0% /var/lib/hadoop/data/k
Is that expected in that it'll take time to transfer data, or should it be filling up already?
Yep it takes a bit! If the datanode got the new config you'll see more data in the upcoming days :)