Page MenuHomePhabricator

Degraded RAID on an-worker1132
Closed, ResolvedPublic

Description

TASK AUTO-GENERATED by Nagios/Icinga RAID event handler

A degraded RAID (megacli) was detected on host an-worker1132. An automatic snapshot of the current RAID status is attached below.

Please sync with the service owner to find the appropriate time window before actually replacing any failed hardware.

CRITICAL: 6 failed LD(s) (Offline, Offline, Offline, Offline, Offline, Offline)

$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
Failed to execute '['/usr/lib/nagios/plugins/check_nrpe', '-4', '-H', 'an-worker1132', '-c', 'get_raid_status_megacli']': RETCODE: 3
STDOUT:
NRPE: Unable to read output

STDERR:
None

Event Timeline

This is showing 6 disks failed. Is it possible there is a different problem that is causing the disks to fail? I do not see any errors for the raid controller

@Cmjohnson I can't think of any reason why six disks should have failed. I think they're all single volume RAID 0 logical volumes, aren't they?
We've power cycled it a few times without much success, so I think that there is definitely something up, but the data currently on the drives can be recreated from other copies.

Please feel free to do whatever you think best, in terms of replacing the controller, replacing the disks, upgrading firmware, using harsh language with it :-)
This server is effectively out of the cluster at the moment, so it can be booted and shut down at will.

@Cmjohnson / @Jclark-ctr - maybe we can try upgrading the firmware first if it's outdated? Thanks, Willy

opened Dell ticket. sent support assist Confirmed: Service Request 165406278 was successfully submitted.

Submitted 2nd ticket Open ticket with dell Confirmed: Service Request 165628610 was successfully submitted.

They have not responded to 1st ticket except for asking for address and contact information that was already in original ticket when filed with dell

Change 906017 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/puppet@production] Decommission an-worker1132 from the Hadoop cluster

https://gerrit.wikimedia.org/r/906017

updated backplane firmware looks like errors have resolved

elukey subscribed.

@Jclark-ctr hi! I tried to reboot the node and it gets blocked when checking the hard drivers, telling me about possible preserved cache etc.. but I don't find the usual menu to clear the cache and proceed. Do you mind to check and see what's happening? You can reboot the node anytime. Thanks!!

@elukey so the foreign drives have effected both os drives it will need to be reimaged and is not letting me clear it. I did open the box and did found a loose Connection on backplane I am not getting anywhere with perc h730 configuration utility

@elukey i was able to clear foreign status but will still need to be reimaged.

@Jclark-ctr thanks! I tried to check the serial console but I still see the error msg about the preserved cache, and I can't really do much on the menu.. the main problem in doing the reimage is that it probably cannot boot in debian install :(

@elukey I was able to cleared configurations

Cookbook cookbooks.sre.hosts.reimage was started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by elukey@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster completed:

  • an-worker1132 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304111448_elukey_12394_an-worker1132.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Reimaged the node, but I still see 11 4TB disks and not 12. Mega cli shows 12 phisical disks but only 11 VDs, so probably we'll need to fix it.

I downtimed the node and stopped hdfs/yarn for the time being.

@Jclark-ctr progress! I was able to reimage, but the two disks in the flex bay seem in Firmware state: Unconfigured(good), Spun Up, so the OS got installed on one of the 4TB nodes. IIRC the two SSDs in the flex bay should be configured as a single RAID1 device, but I have never done it before (and I guess it needs to be done at BIOS time). Do you have time to configure them? The host is downtimed and not serving traffic, feel free to work on it anytime. Thanks in advance!

Change 906017 merged by Stevemunene:

[operations/puppet@production] Decommission an-worker1132 from the Hadoop cluster

https://gerrit.wikimedia.org/r/906017

Mentioned in SAL (#wikimedia-analytics) [2023-04-13T08:19:02Z] <steve_munene> Decommission an-worker1132 from the Hadoop cluster for T333091 reimage

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors:

  • an-worker1132 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors:

  • an-worker1132 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors:

  • an-worker1132 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

I'm looking into this now. Having rebooted the host and gone into the RAID controller setup, I can confirm that we see all 12 data disks and both O/S physical disks present.

image.png (550×788 px, 137 KB)

One virtual disk is present in the config, which is for the O/S.
image.png (540×775 px, 31 KB)

I think what I'll do is put an-worker1132 back into the data_engineering::insetup role, while it is reimaged, then rerun the sre.hadoop.init-hadoop-workers cookbook on it, before putting it back into service.

Change 909202 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Place an-worker1132 back into the insetup role

https://gerrit.wikimedia.org/r/909202

Change 909202 merged by Btullis:

[operations/puppet@production] Place an-worker1132 back into the insetup role

https://gerrit.wikimedia.org/r/909202

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster completed:

  • an-worker1132 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202304171032_btullis_1649002_an-worker1132.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

I recreated the logical drives with: for i in $(seq 0 11); do sudo megacli -CfgLdAdd -r0 [32:$i] -a0; done

The block devices are now preset, but it seems that there has been a partition incorrectly created on the /dev/sdb device.

btullis@an-worker1132:~$ lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                             8:0    0 446.6G  0 disk 
├─sda1                          8:1    0   953M  0 part /boot
├─sda2                          8:2    0     1K  0 part 
└─sda5                          8:5    0 445.7G  0 part 
  ├─an--worker1132--vg-root   254:0    0  55.9G  0 lvm  /
  ├─an--worker1132--vg-swap   254:1    0   9.3G  0 lvm  [SWAP]
  └─an--worker1132--vg-unused 254:2    0 291.4G  0 lvm  
sdb                             8:16   0   3.7T  0 disk 
├─sdb1                          8:17   0   953M  0 part 
└─sdb2                          8:18   0   3.7T  0 part 
sdc                             8:32   0   3.7T  0 disk 
└─sdc1                          8:33   0   3.7T  0 part 
sdd                             8:48   0   3.7T  0 disk 
└─sdd1                          8:49   0   3.7T  0 part 
sde                             8:64   0   3.7T  0 disk 
└─sde1                          8:65   0   3.7T  0 part 
sdf                             8:80   0   3.7T  0 disk 
└─sdf1                          8:81   0   3.7T  0 part 
sdg                             8:96   0   3.7T  0 disk 
└─sdg1                          8:97   0   3.7T  0 part 
sdh                             8:112  0   3.7T  0 disk 
└─sdh1                          8:113  0   3.7T  0 part 
sdi                             8:128  0   3.7T  0 disk 
└─sdi1                          8:129  0   3.7T  0 part 
sdj                             8:144  0   3.7T  0 disk 
└─sdj1                          8:145  0   3.7T  0 part 
sdk                             8:160  0   3.7T  0 disk 
└─sdk1                          8:161  0   3.7T  0 part 
sdl                             8:176  0   3.7T  0 disk 
└─sdl1                          8:177  0   3.7T  0 part 
sdm                             8:192  0   3.7T  0 disk 
└─sdm1                          8:193  0   3.7T  0 part

I will wipe these partitions with the cookbook.

This is looking much better now. I ran the cookbook and wiped the existing data, then put it back into the Hadoop cluster. The only issue now is a duplicate lvm2-pvscan systemd service.

btullis@an-worker1132:~$ systemctl status lvm2
lvm2-activation.service   lvm2-lvmpolld.service     lvm2-lvmpolld.socket      lvm2-monitor.service      lvm2-pvscan@8:18.service  lvm2-pvscan@8:5.service   lvm2.service              
btullis@an-worker1132:~$ systemctl status lvm2-pvscan@8\:18.service
● lvm2-pvscan@8:18.service - LVM event activation on device 8:18
   Loaded: loaded (/lib/systemd/system/lvm2-pvscan@.service; static; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2023-04-17 11:26:56 UTC; 1h 49min ago
     Docs: man:pvscan(8)
 Main PID: 5879 (code=exited, status=5)

Apr 17 11:26:56 an-worker1132 systemd[1]: Starting LVM event activation on device 8:18...
Apr 17 11:26:56 an-worker1132 lvm[5879]:   Multiple VGs found with the same name: skipping an-worker1132-vg
Apr 17 11:26:56 an-worker1132 lvm[5879]:   Use --select vg_uuid=<uuid> in place of the VG name.
Apr 17 11:26:56 an-worker1132 systemd[1]: lvm2-pvscan@8:18.service: Main process exited, code=exited, status=5/NOTINSTALLED
Apr 17 11:26:56 an-worker1132 systemd[1]: lvm2-pvscan@8:18.service: Failed with result 'exit-code'.
Apr 17 11:26:56 an-worker1132 systemd[1]: Failed to start LVM event activation on device 8:18.

This is causing a systemd alert in icinga, but I hope that this will go away after a reboot.
We can see that there is only one pv detected now, which is good.

btullis@an-worker1132:~$ sudo pvscan
  PV /dev/sda5   VG an-worker1132-vg   lvm2 [<445.69 GiB / 89.14 GiB free]
  Total: 1 [<445.69 GiB] / in use: 1 [<445.69 GiB] / in no VG: 0 [0   ]
btullis@an-worker1132:~$

Host rebooted by btullis@cumin1001 with reason: Rebooting after re-adding to Hadoop

Icinga is green after rebooting, so that's good. Resolving this ticket, but we'll continue to monitor this host for any further hardware issues.

Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host an-worker1132.eqiad.wmnet with OS buster executed with errors:

  • an-worker1132 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • First Puppet run failed, asking the operator what to do
    • The reimage failed, see the cookbook logs for the details