Nrpe command_timeout and "Service Check Timed Out" errors
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	herron
	Aug 9 2017, 5:24 PM

Description

We have some check_nrpe based service checks defined (check_raid_hpssacli for instance) with long 90s timeouts. However, it looks like the nrpe service is running with a default command_timeout value of 60s. Possibly originating from the deb package?

#/etc/nagios/nrpe.cfg
command_timeout=60

This is causing some raid checks to timeout, despite being called with a sufficient -t value. For example, here is check_raid_hpssacli being called from the icinga server with a timeout of 90s

einsteinium:~# /usr/lib/nagios/plugins/check_nrpe -H ms-be1030.eqiad.wmnet -c check_raid_hpssacli -t 90
NRPE: Command timed out after 60 seconds

This check is taking ~65s to complete locally

ms-be1030:~# time /usr/local/lib/nagios/plugins/check_hpssacli
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

real	1m5.345s
user	0m36.320s
sys	0m3.796s

Details

	Subject	Repo	Branch	Lines +/-
	Add 90s command_timeout override to nrpe_local.cfg	operations/puppet	production	+3 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Duplicate	jbond	T265138 Work required to prepare for puppet 7
Resolved	SLyngshede-WMF	T273673 replace all puppet crons with systemd timers
Open	None	T132324 Tracking and Reducing cron-spam to root@
Resolved	jcrespo	T84178 investigate RAID BBU auto-learn on db hosts
Resolved	faidon	T84050 Refactor RAID checks (check-raid)
Resolved	faidon	T97998 Add RAID monitoring for HP servers
Resolved	herron	T141252 icinga hp raid check timeout on busy ms-be and db machines
Duplicate	None	T172708 HP RAID (Service Check Timed Out) on swift hosts
Resolved	herron	T172921 Nrpe command_timeout and "Service Check Timed Out" errors

Event Timeline

herron created this task.Aug 9 2017, 5:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 9 2017, 5:24 PM

Temporarily setting command_timeout=90 in /etc/nagios/nrpe_local.cfg on the monitored system fixes it, so will submit a patch for this.

einsteinium:~# /usr/lib/nagios/plugins/check_nrpe -H ms-be1030.eqiad.wmnet -c check_raid_hpssacli -t 90
OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK

Thanks @herron ! Indeed the check is slow when the raid controller is busy and the machines have lots of traffic

Change 370858 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] Add 90s command_timeout override to nrpe_local.cfg

https://gerrit.wikimedia.org/r/370858

gerritbot added a project: Patch-For-Review.Aug 9 2017, 5:49 PM

herron added parent tasks: T172708: HP RAID (Service Check Timed Out) on swift hosts, T141252: icinga hp raid check timeout on busy ms-be and db machines.Aug 10 2017, 6:52 PM

Change 370858 merged by Herron:
[operations/puppet@production] Add 90s command_timeout override to nrpe_local.cfg

https://gerrit.wikimedia.org/r/370858

This looks good so far. The 4 ms-be10NN HP RAID checks that were in service timed out state before deploying are now showing healthy status. Going to mark as resolved.

herron removed a project: Patch-For-Review.Aug 16 2017, 4:24 PM

fgiunchedi mentioned this in T210723: Address recurrent service check time out for "HP RAID" on swift backend hosts.Nov 29 2018, 1:40 PM

Nrpe command_timeout and "Service Check Timed Out" errorsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Nrpe command_timeout and "Service Check Timed Out" errors
Closed, ResolvedPublic
Actions

Related Objects
Search...