Page MenuHomePhabricator

New RAID alerts (e.g. WARNING: unexpectedly checked no devices)
Closed, ResolvedPublic

Description

16 different cloudvirts (e.g. cloudvirt1030.eqiad.wmnet) are currently alerting with MD RAID: WARNING: unexpectedly checked no devices.

A couple of other hosts (cloudcontrol100[67]) are alerting with 'Dell PowerEdge RAID Controller: Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'VD LIST''

I suspect this is related to some recent cleanup patches:

commit 59e9828f4e75102cf6358ae303887c22390190e9
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:15:23 2022 +0200

    smart: Also use new raid_mgmt_tools fact
    
    Bug: T313312
    Change-Id: I616808726f3096afda587dc17695470ae3dbd580


commit 59e9828f4e75102cf6358ae303887c22390190e9
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:15:23 2022 +0200

    smart: Also use new raid_mgmt_tools fact
    
    Bug: T313312
    Change-Id: I616808726f3096afda587dc17695470ae3dbd580


commit 7934e81824edf74efc50ee802fbd751b5b0eceb3
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:44:34 2022 +0200

    raid::perccli: Run the correct monitoring tool
    
    check-raid.py is a legacy tool which for more recent controllers has been
    superseded by specific Icinga monitoring plugins.
    
    Bug: T315608
    Change-Id: I457738535b44342750822bfb371570097f8bc64c

Event Timeline

The error on cloudcontrol is caused by the new perccli script, Simon is looking into a patch.

The waning on the cloudvirts are actually flagged as expected (the previous check was too lax and didn't spot this), for some reason they have mdadm installed, but there are no /dev/mdX devices configured. I suggest you remove mdadm and after the reboots for https://phabricator.wikimedia.org/T317391 the /proc/mdstat device will vanish and this will recover.

Change 831057 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] C:raid::perccli handle case with no virtual devices.

https://gerrit.wikimedia.org/r/831057

Change 831057 merged by Slyngshede:

[operations/puppet@production] C:raid::perccli handle case with no virtual devices.

https://gerrit.wikimedia.org/r/831057

The servers with the PERC controllers are now happy and reports no errors. The controllers are currently configured for JBOD, meaning that there are no virtual disks/RAID arrays and the perccli tool then just happily removes all references to the VD LIST.

Physical disks and battery status is still actively being monitored. We just removed the virtual disk check in the cases where no RAID arrays are configured.

MoritzMuehlenhoff added a subscriber: SLyngshede-WMF.

Reassigning to Andrew to fix up the cloudvirt configs

mdadm is included as a dependency of libguestfs-tools, which is installed because of

https://phabricator.wikimedia.org/T215423

Since those were installed for troubleshooting we can surely live without them, but this seems like a weird fix.

Change 832977 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudvirts: remove libguestfs-tools dependency

https://gerrit.wikimedia.org/r/832977

Change 832977 merged by Andrew Bogott:

[operations/puppet@production] Cloudvirts: remove libguestfs-tools dependency

https://gerrit.wikimedia.org/r/832977

The warning on the cloudvirts are actually flagged as expected (the previous check was too lax and didn't spot this), for some reason they have mdadm installed, but there are no /dev/mdX devices configured. I suggest you remove mdadm and after the reboots for https://phabricator.wikimedia.org/T317391 the /proc/mdstat device will vanish and this will recover.

I have now done the above and rebooted cloudvirt1026 but the same warning is still present.

root@cloudvirt1026:~# ls /proc/mdstat 
ls: cannot access '/proc/mdstat': No such file or directory

I have now done the above and rebooted cloudvirt1026 but the same warning is still present.

Untrue! Manually rerunning the test didn't clear things but I see now a day later that the alert has cleared.

Here's what happens when I remove mdadm (and packages that depend on it) from newer dell servers (tested cloudvirt1053 and 1052):

# dpkg --purge mdadm libguestfs0:amd64 libguestfs-perl libguestfs-tools
(Reading database ... 89049 files and directories currently installed.)
Removing libguestfs-tools (1:1.44.0-2) ...
Purging configuration files for libguestfs-tools (1:1.44.0-2) ...
Removing libguestfs-perl (1:1.44.0-2) ...
Removing libguestfs0:amd64 (1:1.44.0-2) ...
Removing mdadm (4.1-11) ...
update-initramfs: deferring update (trigger activated)
Purging configuration files for mdadm (4.1-11) ...
Processing triggers for man-db (2.9.4-2) ...
Processing triggers for libc-bin (2.31-13+deb11u3) ...
Processing triggers for initramfs-tools (0.140) ...
update-initramfs: Generating /boot/initrd.img-5.10.0-17-amd64
W: Possible missing firmware /lib/firmware/tigon/tg3_tso5.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3_tso.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3.bin for module tg3

After that the host will never boot again.

ALERT!  /dev/mapper/vg0-root does not exist.  Dropping to a shell!

Reimaging the host seems to get it running and also not alerting. I would not like to reimage every single cloudvirt though :(

There are two overlapping problems here, the cloudvirt* use various different Partman setups. The old various labvirt* ones are using HW raid, so the new RAID detection correctly detected the situation that mdadm was incorrecty applied. But all systems >= 1031 use software RAID (partman/raid1-2dev.cfg), so obviously mdadm must not be removed on those.

Oooh that makes sense. OK, I'll try adjusting puppet accordingly and see what I get :) Thanks!

All resolved except cloudvirt1019 and 1020 which can't be rebooted (but will be decom'd sometime soon)