Page MenuHomePhabricator

Nrpe command_timeout and "Service Check Timed Out" errors
Closed, ResolvedPublic

Description

We have some check_nrpe based service checks defined (check_raid_hpssacli for instance) with long 90s timeouts. However, it looks like the nrpe service is running with a default command_timeout value of 60s. Possibly originating from the deb package?

#/etc/nagios/nrpe.cfg
command_timeout=60

This is causing some raid checks to timeout, despite being called with a sufficient -t value. For example, here is check_raid_hpssacli being called from the icinga server with a timeout of 90s

einsteinium:~# /usr/lib/nagios/plugins/check_nrpe -H ms-be1030.eqiad.wmnet -c check_raid_hpssacli -t 90
NRPE: Command timed out after 60 seconds

This check is taking ~65s to complete locally

ms-be1030:~# time /usr/local/lib/nagios/plugins/check_hpssacli
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

real	1m5.345s
user	0m36.320s
sys	0m3.796s

Event Timeline

Temporarily setting command_timeout=90 in /etc/nagios/nrpe_local.cfg on the monitored system fixes it, so will submit a patch for this.

einsteinium:~# /usr/lib/nagios/plugins/check_nrpe -H ms-be1030.eqiad.wmnet -c check_raid_hpssacli -t 90
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

Thanks @herron ! Indeed the check is slow when the raid controller is busy and the machines have lots of traffic

Change 370858 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] Add 90s command_timeout override to nrpe_local.cfg

https://gerrit.wikimedia.org/r/370858

Change 370858 merged by Herron:
[operations/puppet@production] Add 90s command_timeout override to nrpe_local.cfg

https://gerrit.wikimedia.org/r/370858

This looks good so far. The 4 ms-be10NN HP RAID checks that were in service timed out state before deploying are now showing healthy status. Going to mark as resolved.