New RAID alerts (e.g. WARNING: unexpectedly checked no devices)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Andrew
	Sep 8 2022, 4:40 PM

Description

16 different cloudvirts (e.g. cloudvirt1030.eqiad.wmnet) are currently alerting with MD RAID: WARNING: unexpectedly checked no devices.

A couple of other hosts (cloudcontrol100[67]) are alerting with 'Dell PowerEdge RAID Controller: Failed to execute ['/usr/local/lib/nagios/plugins/get-raid-status-perccli']: KeyError 'VD LIST''

I suspect this is related to some recent cleanup patches:

commit 59e9828f4e75102cf6358ae303887c22390190e9
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:15:23 2022 +0200

    smart: Also use new raid_mgmt_tools fact
    
    Bug: T313312
    Change-Id: I616808726f3096afda587dc17695470ae3dbd580


commit 59e9828f4e75102cf6358ae303887c22390190e9
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:15:23 2022 +0200

    smart: Also use new raid_mgmt_tools fact
    
    Bug: T313312
    Change-Id: I616808726f3096afda587dc17695470ae3dbd580


commit 7934e81824edf74efc50ee802fbd751b5b0eceb3
Author: Moritz Mühlenhoff <mmuhlenhoff@wikimedia.org>
Date:   Thu Sep 8 14:44:34 2022 +0200

    raid::perccli: Run the correct monitoring tool
    
    check-raid.py is a legacy tool which for more recent controllers has been
    superseded by specific Icinga monitoring plugins.
    
    Bug: T315608
    Change-Id: I457738535b44342750822bfb371570097f8bc64c

Details

	Subject	Repo	Branch	Lines +/-
	Cloudvirts: remove libguestfs-tools dependency	operations/puppet	production	+0 -1
	C:raid::perccli handle case with no virtual devices.	operations/puppet	production	+18 -4

Customize query in gerrit

Related Objects

Mentioned Here: T215423: Install libguestfs-tools on cloudvirts?

Event Timeline

Andrew created this task.Sep 8 2022, 4:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2022, 4:40 PM

• nskaggs subscribed.Sep 8 2022, 4:52 PM

The error on cloudcontrol is caused by the new perccli script, Simon is looking into a patch.

The waning on the cloudvirts are actually flagged as expected (the previous check was too lax and didn't spot this), for some reason they have mdadm installed, but there are no /dev/mdX devices configured. I suggest you remove mdadm and after the reboots for https://phabricator.wikimedia.org/T317391 the /proc/mdstat device will vanish and this will recover.

Change 831057 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] C:raid::perccli handle case with no virtual devices.

https://gerrit.wikimedia.org/r/831057

gerritbot added a project: Patch-For-Review.Sep 9 2022, 10:26 AM

SLyngshede-WMF claimed this task.Sep 9 2022, 12:06 PM

RhinosF1 subscribed.Sep 10 2022, 12:01 PM

Change 831057 merged by Slyngshede:

[operations/puppet@production] C:raid::perccli handle case with no virtual devices.

https://gerrit.wikimedia.org/r/831057

Maintenance_bot removed a project: Patch-For-Review.Sep 12 2022, 7:30 AM

The servers with the PERC controllers are now happy and reports no errors. The controllers are currently configured for JBOD, meaning that there are no virtual disks/RAID arrays and the perccli tool then just happily removes all references to the VD LIST.

Physical disks and battery status is still actively being monitored. We just removed the virtual disk check in the cases where no RAID arrays are configured.

Reassigning to Andrew to fix up the cloudvirt configs

mdadm is included as a dependency of libguestfs-tools, which is installed because of

https://phabricator.wikimedia.org/T215423

Since those were installed for troubleshooting we can surely live without them, but this seems like a weird fix.

Change 832977 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] Cloudvirts: remove libguestfs-tools dependency

https://gerrit.wikimedia.org/r/832977

gerritbot added a project: Patch-For-Review.Sep 19 2022, 10:37 AM

Change 832977 merged by Andrew Bogott:

[operations/puppet@production] Cloudvirts: remove libguestfs-tools dependency

https://gerrit.wikimedia.org/r/832977

The warning on the cloudvirts are actually flagged as expected (the previous check was too lax and didn't spot this), for some reason they have mdadm installed, but there are no /dev/mdX devices configured. I suggest you remove mdadm and after the reboots for https://phabricator.wikimedia.org/T317391 the /proc/mdstat device will vanish and this will recover.

I have now done the above and rebooted cloudvirt1026 but the same warning is still present.

root@cloudvirt1026:~# ls /proc/mdstat 
ls: cannot access '/proc/mdstat': No such file or directory

I have now done the above and rebooted cloudvirt1026 but the same warning is still present.

Untrue! Manually rerunning the test didn't clear things but I see now a day later that the alert has cleared.

Here's what happens when I remove mdadm (and packages that depend on it) from newer dell servers (tested cloudvirt1053 and 1052):

# dpkg --purge mdadm libguestfs0:amd64 libguestfs-perl libguestfs-tools
(Reading database ... 89049 files and directories currently installed.)
Removing libguestfs-tools (1:1.44.0-2) ...
Purging configuration files for libguestfs-tools (1:1.44.0-2) ...
Removing libguestfs-perl (1:1.44.0-2) ...
Removing libguestfs0:amd64 (1:1.44.0-2) ...
Removing mdadm (4.1-11) ...
update-initramfs: deferring update (trigger activated)
Purging configuration files for mdadm (4.1-11) ...
Processing triggers for man-db (2.9.4-2) ...
Processing triggers for libc-bin (2.31-13+deb11u3) ...
Processing triggers for initramfs-tools (0.140) ...
update-initramfs: Generating /boot/initrd.img-5.10.0-17-amd64
W: Possible missing firmware /lib/firmware/tigon/tg3_tso5.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3_tso.bin for module tg3
W: Possible missing firmware /lib/firmware/tigon/tg3.bin for module tg3

After that the host will never boot again.

ALERT!  /dev/mapper/vg0-root does not exist.  Dropping to a shell!

Reimaging the host seems to get it running and also not alerting. I would not like to reimage every single cloudvirt though :(

There are two overlapping problems here, the cloudvirt* use various different Partman setups. The old various labvirt* ones are using HW raid, so the new RAID detection correctly detected the situation that mdadm was incorrecty applied. But all systems >= 1031 use software RAID (partman/raid1-2dev.cfg), so obviously mdadm must not be removed on those.

Oooh that makes sense. OK, I'll try adjusting puppet accordingly and see what I get :) Thanks!

All resolved except cloudvirt1019 and 1020 which can't be rebooted (but will be decom'd sometime soon)

New RAID alerts (e.g. WARNING: unexpectedly checked no devices)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

New RAID alerts (e.g. WARNING: unexpectedly checked no devices)
Closed, ResolvedPublic
Actions