Page MenuHomePhabricator

check_hpssacli should report on battery failures and cache disabled
Closed, ResolvedPublic

Description

As discovered in T163777 check_hpssacli failed to report a "no battery" situation and (consequently?) a cache permanently disabled.
It looks like battery status isn't reported at all in show (but it is in show detail) in case the battery isn't there:

root@ms-be1022:~# hpssacli controller slot=3 show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK

root@ms-be1021:~# hpssacli controller slot=3 show status

Smart Array P840 in Slot 3
   Controller Status: OK
   Cache Status: Permanently Disabled

Looking at the code, check_hpssacli skips checking cache status because it check individual LD's acceleration method instead, though I think in this case sth was wrong with the battery and therefore the cache so it should have been catched.

Event Timeline

Change 354079 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] raid/hpssacli: WARN on permanently disabled cache

https://gerrit.wikimedia.org/r/354079

Change 354080 had a related patch set uploaded (by Faidon Liambotis; owner: Faidon Liambotis):
[operations/puppet@production] raid/hpssacli: check for cable errors/no batteries

https://gerrit.wikimedia.org/r/354080

@faidon let me know if you want the Icinga RAID handler to open tasks also for warnings, these includes the above and the predictive drive failures for HP controllers.

It probably shouldn't; these issues are rare enough and complex enough that it's probably better if we handle them manually for now, I think.

Change 354079 merged by Faidon Liambotis:
[operations/puppet@production] raid/hpssacli: WARN on permanently disabled cache

https://gerrit.wikimedia.org/r/354079

Change 354080 merged by Faidon Liambotis:
[operations/puppet@production] raid/hpssacli: check for cable errors/no batteries

https://gerrit.wikimedia.org/r/354080

Change 356070 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] raid/hpssacli: allow NRPE to execute all commands

https://gerrit.wikimedia.org/r/356070

Change 356070 merged by Faidon Liambotis:
[operations/puppet@production] raid/hpssacli: allow NRPE to execute all commands

https://gerrit.wikimedia.org/r/356070

Mentioned in SAL (#wikimedia-operations) [2017-05-29T17:29:47Z] <volans> disabled puppet on tegmen and disabled raid_handler temporarily T163998

Mentioned in SAL (#wikimedia-operations) [2017-05-29T17:40:12Z] <volans> re-enabled puppet on tegmen and re-enabled raid_handler T163998

faidon claimed this task.

This has been fixed for a while.