Page MenuHomePhabricator

icinga raid monitoring inoperable for H750 controllers
Closed, ResolvedPublic

Description

MegaCLI monitoring doesn't work for the new generation of Dell PERC H750 controllers.

Support in a private repo for the binaries is roll(ing) out via the parent of this task.

This task should track the fix for icinga monitoring. An example of the bad monitoring check can be viewed for dumpsdata1007, which has virtual disks but megacli reports none: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=dumpsdata1007

Event Timeline

RobH added a project: observability.
RobH added a subscriber: ArielGlenn.

T308027 tracks the private repo deployment, but I didn't see anything to track the fix for icinga monitoring for the new perc h750 controllers.

I've added in observability but not sure if that is correct.

RobH added a subscriber: MoritzMuehlenhoff.

It's not broken, it's just not yet implemented :-) https://gerrit.wikimedia.org/r/c/operations/puppet/+/812250 is the main patch, but it first needs a revised raid fact which is handled at https://phabricator.wikimedia.org/T313312

RobH renamed this task from icinga raid montioring broken for H750 controllers to icinga raid montioring inoperable for H750 controllers.Aug 22 2022, 3:22 PM

Thanks for the update! This was raised as a concern when I handled of dumpsdata1007 for use in service, but noted it didn't yet have accurate raid monitoring.

I feel a bit queasy about having a server in production without the ability to monitor the raid; what do folks think about this?

jbond triaged this task as High priority.Sep 6 2022, 2:42 PM
jbond subscribed.

Change 830578 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] raid: fix raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/830578

Volans renamed this task from icinga raid montioring inoperable for H750 controllers to icinga raid monitoring inoperable for H750 controllers.Sep 7 2022, 10:30 AM

Change 830578 merged by Jbond:

[operations/puppet@production] raid: fix raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/830578

Change 825369 had a related patch set uploaded (by Jbond; author: Muehlenhoff):

[operations/puppet@production] Initially adapt perccli to use the new raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/825369

Change 830645 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Fix entries in raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/830645

Change 830645 merged by Muehlenhoff:

[operations/puppet@production] Fix entries in raid_mgmt_tools fact

https://gerrit.wikimedia.org/r/830645

Change 825369 merged by Muehlenhoff:

[operations/puppet@production] Switch to the new raid_mgmt_tools fact to enable RAID tools

https://gerrit.wikimedia.org/r/825369

Change 830860 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] raid::perccli: Run the correct monitoring tool

https://gerrit.wikimedia.org/r/830860

Change 830860 merged by Muehlenhoff:

[operations/puppet@production] raid::perccli: Run the correct monitoring tool

https://gerrit.wikimedia.org/r/830860

The servers with Perc H750 are now correctly detected by Puppet and the respective new monitoring script is run against them, the check is called "Dell PowerEdge RAID Controller" and can e.g. be seen at stat1009.

One thing still pending it the adaption of the raid handler, so that Phabricator tasks are opened automatically if a disk/controller fails.

fgiunchedi subscribed.

Adding back infra foundations and SRE here, though leaving out o11y since I don't think there's any actionable at this time for us