Review check_ping settings
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	herron
	Aug 14 2017, 2:36 PM

Description

Potential icinga tunable - Number of packets sent by each call to check_ping. Currently host check sends 5 packets which takes ~4s (per host). The host check runs every 5 minutes.

einsteinium:~# time /usr/lib/nagios/plugins/check_ping -H lists.wikimedia.org -w 500,20% -c 2000,100% -p 5
PING OK - Packet loss = 0%, RTA = 0.42 ms|rta=0.420000ms;500.000000;2000.000000;0.000000 pl=0%;20;100;0

real	0m4.084s
user	0m0.000s
sys	0m0.000s

Lowering to 1 packet speeds the check up considerably

einsteinium:~# time /usr/lib/nagios/plugins/check_ping -H lists.wikimedia.org -w 500,20% -c 2000,100% -p 1
PING OK - Packet loss = 0%, RTA = 0.47 ms|rta=0.474000ms;500.000000;2000.000000;0.000000 pl=0%;20;100;0

real	0m0.009s
user	0m0.000s
sys	0m0.004s

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		herron	T173050 Investigate icinga (einsteinium) load
		Declined		None	T173315 Review check_ping settings

Event Timeline

herron created this task.Aug 14 2017, 2:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 14 2017, 2:36 PM

faidon moved this task from Inbox to Up next on the observability board.Aug 21 2017, 3:17 PM

faidon added a project: SRE.Aug 25 2017, 2:04 PM

1 ping is going to be too error-prone though :/ A single packet may be dropped for whatever reason on either side or in transport. Especially when talking about cross-DC checks, this isn't too uncommon. We monitor levels of packetloss with smokeping, but we wouldn't want to see a large amount of random hosts alerting when such events happen.

Perhaps we could lower the ping interval in order to shorten the time of the check, though? What problem have you noticed with the existing configuration?

The check_ping on our icinga hosts doesn't seem to have an option to set the equivalent of the -i of the ping command. Reducing from 5 to 3 packets half the time to 2s per check, but I'm not sure if it's worth given the increased risk of false positives (although for 3 packets should be low enough inside our prod network).
I'm too curious about what is the current issue/bottleneck

Should have clarified in the description that the current host check has a max check of 2 so was wondering about the pros/cons of the current 2 consecutive 4s 5 packet check_ping versus say 3 or 5 consecutive .009s 1 packet check_pings for the purpose of host up/down.

It's not specifically a problem as it is but icinga runs with a high cpu load and it seems tuning to require less concurrency overall might help with that.

Also, FWIW, https://www.icinga.com/docs/icinga1/latest/en/tuning.html item 11 outlines a similar approach.

I read that item 11 and noticed the very end of it "Another option would be to use a faster plugin (i.e. check_fping) as the host_check_command instead of check_ping.". How about that one and try to use check_fping or another plugin?

https://exchange.nagios.org/directory/Plugins/Network-Protocols/ICMP/check_fping/details

Features / Improvements 

- Includes the code from fping (in a modified state) instead of calling it and parsing the output. 

- Takes care of 'Unable to parse ping output' and such. 

- Fixes the 1 second maximum threshold value in check_fping, which was hardcoded in the fping source. 

- Removes the ridiculous appearance of precision down to a millionth of a millisecond (0.1000000 RTA), and instead gives proper and valuable output in a sensible manner. 

- Both check_ping and check_fping command line syntax works just fine, so it could be used to replace either one (-n and -p both denote number of packets to send). 

Drawbacks: 

Requires root privileges for raw sockets (if run setsuid it drops 
privileges again after obtaining the socket). This is common to all ping programs though. 

ToDo: 
Parallellize packet sending. This requires a different packet identity encoding algorithm, as well as some manner of delay so that hosts don't think they're being flooded, so I'll wait a while with this.

Check_fping looks to be faster at multiple pings than check_ping when testing in a vagrant box. Nice!

jessie:~/check_fping# time ./check_fping -n 5 -H 10.0.0.1
OK - 10.0.0.1: loss 0%, rta 1.34 ms

real	0m0.510s
user	0m0.000s
sys	0m0.008s

On the other hand check_ping is faster when run with 1 ping.

einsteinium:~# time /usr/lib/nagios/plugins/check_ping -H lists.wikimedia.org -w 500,20% -c 2000,100% -p 1
PING OK - Packet loss = 0%, RTA = 0.47 ms|rta=0.474000ms;500.000000;2000.000000;0.000000 pl=0%;20;100;0

real	0m0.009s
user	0m0.000s
sys	0m0.004s

I guess it comes down to which is the lesser of two evils... Validating/compiling/maintaining check_fping or configuring check_ping for just one ping per check?

I think just a single ping is not enough and it would just be about reducing it to 2 or 3 packets instead of 5.

I'm not sure if a different implementation (like fping) is going to make a difference. check_ping is "slow" because it sends multiple packets over 1-second intervals -- that will always consume real time (but not CPU time) irrespective of implementation.

I'd be hesitant to switch implementations just because of that. I'd be also hesitant to reducing from 5 to 2-3 packets if we're not under resource pressure. Thoughts?

We could leave it alone. My main concern was around reducing check concurrency after seeing the parent icinga process sitting at 100% cpu each time I looked, but I don't have a good trend to highlight this clearly. Are we gathering per-process statistics somewhere?

No, no per-process statistics that I know of :(

Declining this, for now at least, per feedback I've heard from you and others :)

Fair enough :)

Review check_ping settingsClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Review check_ping settings
Closed, DeclinedPublic
Actions

Related Objects
Search...