Page MenuHomePhabricator

Address recurrent service check time out for "HP RAID" on swift backend hosts
Open, NormalPublic

Description

The "HP RAID" check (check_hpssacli) is known to timeout on busy ms-be hosts. The timeout has been recurring now for a while on icinga and doesn't provide a whole lot of value in this state.

Potential solutions in no particular order:

  1. Increase nrpe server timeout
  2. Increase check retries
  3. Move to an asynchronous check model, where the generating results and checking for alert states are decoupled

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 29 2018, 1:12 PM
fgiunchedi renamed this task from Address recurrent service check time out for "HP RAID" to Address recurrent service check time out for "HP RAID" on swift backend hosts.Nov 29 2018, 1:12 PM
fgiunchedi added projects: monitoring, Operations.

Note we've been here before in T172921: Nrpe command_timeout and "Service Check Timed Out" errors and sadly the command check timeout can be changed only globally on the icinga side, not per-service.

jijiki triaged this task as Normal priority.Dec 4 2018, 10:24 PM
Volans added a subscriber: Volans.Dec 12 2018, 5:54 PM

There are few options that occur to me right away

  • Cron generates Prometheus metrics and exposed via the node text exporter
  • Script that runs on cron and caches the output of hpssacli and update the nrpe check to use the cached version (with additional staleness check)
  • Passive Icinga checks
  • hpraid_exporter: https://github.com/chromium58/hpraid_exporter

I would suggest to pick one of those solutions:

  • Script that runs on cron and caches the output of hpssacli and update the nrpe check to use the cached version (with additional staleness check)
  • Passive Icinga checks

Regarding this last one it would basically be convert it to a passive check with a crontab and a large-enough timeout. It might be tricky on the Puppet side and of course we should do this only for the ms-be hosts.
I know there are concerns over passive checks, but what we're trying to do with the other approaches is basically re-implementing a passive check via NRPE.

This is the kind of information this check provides:

WARNING: Slot 0: Predictive Failure: 1I:1:3 - OK: 1I:1:1, 1I:1:2, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Cache: Permanently Disabled - Battery/Capacitor: Failed (Replace Batteries)

I honestly don't see how we could fit this kind of rich information in some Prometheus metrics if not creating 1 metric per disk (with mapping values 0=ok, 1=predictive, 2=failed for example) + additional metrics for the rest of the checks.
And going with a single metric that reflects the nagios exit code of the script seems a step backwards and not forwards to me.
As for the over-time tracking of this that would be included/implicit in Prometheus I'm not sure it's something very useful because if a disk is broken we open a task to replace it, so it's already trackable. And for the rest is more important to know it flapped (Icinga+IRC logs have the same info) than a real time-series data.

My 2 cents :)

Agreed in this case it might be worth moving to passive icinga checks or cache hpssacli output, mostly by exclusion of other options: the prometheus approach would work but check_hpssacli has a lot of logic so we'd need to export overall results (e.g. health ok / not-ok) and consult logs on failure. Ditto for hpraid_exporter AFAICS it doesn't have extra checks/logic and we'd need to implement/port that from check_hpssacli.

ema added a subscriber: ema.Dec 17 2018, 2:11 PM
fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Jan 2 2019, 11:54 AM