Page MenuHomePhabricator

Re-add disk to an-worker1100
Closed, ResolvedPublic

Description

In https://phabricator.wikimedia.org/T280132, a defective disk was replaced, and now we can re-add the disk to the host on the software side.

Event Timeline

@elukey I was following the instructions at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk but I got a nonzero exit code.

Based on this comment:

# From the previous commands you should be able to fill in the variables 
# with the values of the disk's properties indicated below:
# X => Enclosure Device ID
# Y => Slot Number
# Z => Controller (Adapter) number
megacli -PDMakeGood -PhysDrv[X:Y] -aZ

And the output:

Adapter #0
...
Enclosure Device ID: 32
Slot Number: 11
Firmware state: Online, Spun Up

For step 6:

Add the single disk RAID0 array (use the details from the steps above):
sudo megacli -CfgLdAdd -r0 [32:0] -a0

I ran sudo megacli -CfgLdAdd -r0 [32:11] -a0 but got output:

Exit Code: 0x1a

And I see an odd number of disks, which makes me think the disk is still missing:

razzi@an-worker1100:~$ ls /dev/sd? | wc -w
23

On this webpage: https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages it has:

0x1a Maximum LDs are already configured

Do you know what that means?

@razzi you have the wrong slot, it is the 10th :)

Enclosure Device ID: 32
Slot Number: 10
Enclosure position: 1
Device Id: 10
WWN: 5000c500c9829a03
Sequence Number: 7
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA

Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Sector Size:  512
Logical Sector Size:  512
Physical Sector Size:  512
Firmware state: Unconfigured(good), Spun Up   <==================

Check the output of sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state", Online etc.. is the regular state.

Ok, it looks like everything is working here, but disk usage is still at 0%:

NAME            FSTYPE            LABEL           UUID                                   FSAVAIL FSUSE% MOUNTPOINT
...
sdl1            ext4              hadoop-k        cb58c727-dec9-4abf-8b21-3d70a6443b6d      1.8T     0% /var/lib/hadoop/data/k

Is that expected in that it'll take time to transfer data, or should it be filling up already?

Yep it takes a bit! If the datanode got the new config you'll see more data in the upcoming days :)

I checked and the disk is filling up; this can be closed.