We currently have a "RAID check" script, check-raid.py, which has a number of condtionals that pick a different code path, depending on the RAID detected. While it works, and is reasonable easy to hack on, it's fairly limited in many ways. For instance, it assumes that a server can only have one type of RAID, so it's impossible to monitor both e.g. MegaCli and Linux md (software RAID), which is the config the Swift backend boxes have.
Additionally, we're missing all kinds of MegaCli checks, like battery status errors, missing logical drives, predictive errors, different configured from runtime settings (e.g. the usual "configure WriteBack but active is WriteThrough") or even weird statuses such as battery train schedules. We should do a research around what people do with MegaCli (there's a ton of config options) and incorporate this into our check(s). Care should be taken to do as much work as possible in as few invocations as possible, as RAID controllers are not always happy with being flooded with management commands.
Moreover, there is another class of checks that have to do with configuration errors, rather than hardware errors; for example, last week Sean mentioned an issue with one of the database boxes that had Adaptive ReadAhead configured, which apparently is a very bad setting for InnoDB.
On the Linux md front, it'd be nice to kill those DegradedArray cronspam emails that are being sent by mdadm, and rely exclusively on a well-written Icinga check instead.
Finally, for bonus points, we should really add some SMART checks with smartctl, which work both for regular disks, SSDs, and for underlying disks in disk arrays (e.g. with -d megaraid,0).
Description
Details
- Reference
- rt7780
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T294906 Puppet Improvements 2021/2022 | |||
Open | jbond | T265138 Work required to prepare for puppet 6 | |||
Open | None | T273673 replace all puppet crons with systemd timers | |||
Open | None | T132324 Tracking and Reducing cron-spam to root@ | |||
Resolved | jcrespo | T84178 investigate RAID BBU auto-learn on db hosts | |||
Resolved | faidon | T84050 Refactor RAID checks (check-raid) | |||
Open | None | T83476 Icinga RAID check: monitor rebuild status | |||
Resolved | faidon | T97998 Add RAID monitoring for HP servers | |||
Resolved | herron | T141252 icinga hp raid check timeout on busy ms-be and db machines | |||
Resolved | herron | T172921 Nrpe command_timeout and "Service Check Timed Out" errors |
Event Timeline
Όταν Τρι Ιουλ 01 17:26:07 2014, faidon γράψε:
We currently have a "RAID check" script, check-raid.py, which has a
number of condtionals that pick a different code path, depending on
the RAID detected. While it works, and is reasonable easy to hack
on, it's fairly limited in many ways. For instance, it assumes that
a server can only have one type of RAID, so it's impossible to
monitor both e.g. MegaCli and Linux md (software RAID), which is
the config the Swift backend boxes have.Additionally, we're missing all kinds of MegaCli checks, like battery
status errors, missing logical drives, predictive errors, different
configured from runtime settings (e.g. the usual "configure
WriteBack but active is WriteThrough") or even weird statuses such
as battery train schedules. We should do a research around what
people do with MegaCli (there's a ton of config options) and
incorporate this into our check(s). Care should be taken to do as
much work as possible in as few invocations as possible, as RAID
controllers are not always happy with being flooded with management
commands.Moreover, there is another class of checks that have to do with
configuration errors, rather than hardware errors; for example,
last week Sean mentioned an issue with one of the database boxes
that had Adaptive ReadAhead configured, which apparently is a very
bad setting for InnoDB.On the Linux md front, it'd be nice to kill those DegradedArray
cronspam emails that are being sent by mdadm, and rely exclusively
on a well-written Icinga check instead.Finally, for bonus points, we should really add some SMART checks with
smartctl, which work both for regular disks, SSDs, and for
underlying disks in disk arrays (e.g. with -d megaraid,0).
Not trying to do all this but in the meantime here's a changeset to monitor all
the raid types on a box which seems like a pretty basic need right now:
https://gerrit.wikimedia.org/r/#/c/145018/
It seems to me this should be set to High priority as it prevented an alarm from triggering as a disk in the primary array of virt1002 failed.
I'm not sure how helpful I'd be with the Megacli side of things, but I'll try to work in a patch to improve the md monitoring at least.
@Jgreen mentioned that he has a fix for this that's already deployed in frack. Let's wait for him :)
I did ask the same question to Jeff without remembering this ticket, anyways for reference I'm attaching it
Here's the whole frack module which also installs the appropriate driver based which raid controller(s) are found by a facter lib.
Change 290986 had a related patch set uploaded (by Faidon Liambotis):
Create raid module to hold RAID monitoring checks
Change 290988 had a related patch set uploaded (by Faidon Liambotis):
raid: add a new "raid" fact
Change 290999 had a related patch set uploaded (by Faidon Liambotis):
raid: vary package installation on the RAID installed
Change 291013 had a related patch set uploaded (by Faidon Liambotis):
raid: setup multiple checks, one per each RAID found
Change 290986 merged by Faidon Liambotis:
Create raid module to hold RAID monitoring checks
Change 290999 merged by Faidon Liambotis:
raid: vary package installation on the RAID installed
Change 291013 merged by Faidon Liambotis:
raid: setup multiple checks, one per each RAID found