Page MenuHomePhabricator

HP RAID (Service Check Timed Out) on swift hosts
Closed, DuplicatePublic

Description

ms-be1030 - HP RAID - UNKNOWN - Service Check Timed Out)

Issue seems to be that the HP RAID checks takes longer to run than the Icinga timeout:

ayounsi@ms-be1030:~$ time /usr/local/lib/nagios/plugins/check_hpssacli
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

real	1m9.053s
user	0m35.384s
sys	0m4.060s

Event Timeline

nrpe::monitor_service has parameter "timeout" for this.

example:

modules/role/manifests/mail/mx.pp

 nrpe::monitor_service { 'check_exim_queue':
..
nrpe_command   => '/usr/local/lib/nagios/plugins/check_exim_queue -w 1000 -c 3000',
..
timeout        => 20,

Since this has been showing unknown sporadically, I think adding an additional minute is a good idea.

Change 370505 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/puppet@production] Bumping HP RAID Icinga check timeout from 60 to 90s

https://gerrit.wikimedia.org/r/370505

Change 370505 merged by Ayounsi:
[operations/puppet@production] Bumping HP RAID Icinga check timeout from 60 to 90s

https://gerrit.wikimedia.org/r/370505

This is apparently an issue again. See screenshot below from today:

RAID-check-timeouts-ms-be.png (299×1 px, 84 KB)

Dzahn edited projects, added SRE; removed Patch-For-Review.

HP RAID checks are timing out on all eqiad swift hosts.

Dzahn renamed this task from HP RAID (Service Check Timed Out) to HP RAID (Service Check Timed Out) on swift hosts.Mar 15 2019, 10:01 AM
Dzahn added a project: SRE-swift-storage.

Given that this is quite old I'm closing it as duplicate of T210723 that has a more recent discussion of possible solutions. (CC @colewhite )