Page MenuHomePhabricator

Refactor RAID checks (check-raid)
Closed, ResolvedPublic

Description

We currently have a "RAID check" script, check-raid.py, which has a number of condtionals that pick a different code path, depending on the RAID detected. While it works, and is reasonable easy to hack on, it's fairly limited in many ways. For instance, it assumes that a server can only have one type of RAID, so it's impossible to monitor both e.g. MegaCli and Linux md (software RAID), which is the config the Swift backend boxes have.
Additionally, we're missing all kinds of MegaCli checks, like battery status errors, missing logical drives, predictive errors, different configured from runtime settings (e.g. the usual "configure WriteBack but active is WriteThrough") or even weird statuses such as battery train schedules. We should do a research around what people do with MegaCli (there's a ton of config options) and incorporate this into our check(s). Care should be taken to do as much work as possible in as few invocations as possible, as RAID controllers are not always happy with being flooded with management commands.
Moreover, there is another class of checks that have to do with configuration errors, rather than hardware errors; for example, last week Sean mentioned an issue with one of the database boxes that had Adaptive ReadAhead configured, which apparently is a very bad setting for InnoDB.
On the Linux md front, it'd be nice to kill those DegradedArray cronspam emails that are being sent by mdadm, and rely exclusively on a well-written Icinga check instead.
Finally, for bonus points, we should really add some SMART checks with smartctl, which work both for regular disks, SSDs, and for underlying disks in disk arrays (e.g. with -d megaraid,0).

Details

Reference
rt7780

Related Objects

Event Timeline

rtimport raised the priority of this task from to Normal.Dec 18 2014, 1:56 AM
rtimport added a project: ops-core.
rtimport set Reference to rt7780.
faidon created this task.Jul 1 2014, 5:26 PM

Όταν Τρι Ιουλ 01 17:26:07 2014, faidon γράψε:

We currently have a "RAID check" script, check-raid.py, which has a
number of condtionals that pick a different code path, depending on
the RAID detected. While it works, and is reasonable easy to hack
on, it's fairly limited in many ways. For instance, it assumes that
a server can only have one type of RAID, so it's impossible to
monitor both e.g. MegaCli and Linux md (software RAID), which is
the config the Swift backend boxes have.
Additionally, we're missing all kinds of MegaCli checks, like battery
status errors, missing logical drives, predictive errors, different
configured from runtime settings (e.g. the usual "configure
WriteBack but active is WriteThrough") or even weird statuses such
as battery train schedules. We should do a research around what
people do with MegaCli (there's a ton of config options) and
incorporate this into our check(s). Care should be taken to do as
much work as possible in as few invocations as possible, as RAID
controllers are not always happy with being flooded with management
commands.
Moreover, there is another class of checks that have to do with
configuration errors, rather than hardware errors; for example,
last week Sean mentioned an issue with one of the database boxes
that had Adaptive ReadAhead configured, which apparently is a very
bad setting for InnoDB.
On the Linux md front, it'd be nice to kill those DegradedArray
cronspam emails that are being sent by mdadm, and rely exclusively
on a well-written Icinga check instead.
Finally, for bonus points, we should really add some SMART checks with
smartctl, which work both for regular disks, SSDs, and for
underlying disks in disk arrays (e.g. with -d megaraid,0).

Not trying to do all this but in the meantime here's a changeset to monitor all
the raid types on a box which seems like a pretty basic need right now:
https://gerrit.wikimedia.org/r/#/c/145018/

Dependency by ticket #7916 added by springle

coren added a subscriber: coren.Feb 17 2015, 4:38 PM
coren added a comment.Feb 17 2015, 4:41 PM

It seems to me this should be set to High priority as it prevented an alarm from triggering as a disk in the primary array of virt1002 failed.

I'm not sure how helpful I'd be with the Megacli side of things, but I'll try to work in a patch to improve the md monitoring at least.

faidon added a subscriber: Jgreen.Feb 17 2015, 5:20 PM

@Jgreen mentioned that he has a fix for this that's already deployed in frack. Let's wait for him :)

faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.

I did ask the same question to Jeff without remembering this ticket, anyways for reference I'm attaching it

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 21 2015, 3:58 PM

I did ask the same question to Jeff without remembering this ticket, anyways for reference I'm attaching it

Here's the whole frack module which also installs the appropriate driver based which raid controller(s) are found by a facter lib.

Change 290986 had a related patch set uploaded (by Faidon Liambotis):
Create raid module to hold RAID monitoring checks

https://gerrit.wikimedia.org/r/290986

Change 290988 had a related patch set uploaded (by Faidon Liambotis):
raid: add a new "raid" fact

https://gerrit.wikimedia.org/r/290988

Change 290999 had a related patch set uploaded (by Faidon Liambotis):
raid: vary package installation on the RAID installed

https://gerrit.wikimedia.org/r/290999

Change 291013 had a related patch set uploaded (by Faidon Liambotis):
raid: setup multiple checks, one per each RAID found

https://gerrit.wikimedia.org/r/291013

Change 290986 merged by Faidon Liambotis:
Create raid module to hold RAID monitoring checks

https://gerrit.wikimedia.org/r/290986

Change 290988 merged by Faidon Liambotis:
raid: add a new "raid" fact

https://gerrit.wikimedia.org/r/290988

Change 290999 merged by Faidon Liambotis:
raid: vary package installation on the RAID installed

https://gerrit.wikimedia.org/r/290999

Change 291013 merged by Faidon Liambotis:
raid: setup multiple checks, one per each RAID found

https://gerrit.wikimedia.org/r/291013

faidon closed this task as Resolved.May 30 2016, 10:00 PM
faidon claimed this task.

This is now all done :)