Page MenuHomePhabricator

Push check latency and check execution time to Prometheus
Closed, ResolvedPublic

Description

Average check latency on icinga1001 [1] appears to swing between 6 and 15 seconds. Latency larger than 10 seconds is not ideal [2].

Daniel and I have implemented some tweaks to help alleviate the problem, but cannot definitively answer whether or not it was successful. I think it might be worthwhile to gather these metrics into grafana as all the evidence so far is fairly anecdotal.

[1] https://icinga-stretch.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4
[2] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html

The "max_concurrent_checks" setting is a major factor for performance and is now configurable in Hiera and set to "0" (unlimited) on the new server while it is set to 10000 on the old server. [3].

Reference is made to section 7 in the tuning guide [4]. which says "If you are seeing high latency values (> 10 or 15 seconds) for the majority of your service checks (via the extinfo CGI), you are probably starving Nagios of the checks it needs."

[3] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469253/
[4] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html

Event Timeline

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptOct 26 2018, 4:08 PM
jijiki triaged this task as Normal priority.Oct 26 2018, 4:09 PM
jijiki added a project: Icinga.
Dzahn added a comment.Oct 26 2018, 5:14 PM

comparison between old and new:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4

vs

https://icinga-stretch.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4

Note: both machines have 32 cores but the ones on the old server are actually running at higher frequency than the new ones.
The old machine actually has more bogomips than the new one.

[einsteinium:~] $  grep bogomips /proc/cpuinfo  | wc -l
32
[einsteinium:~] $ grep bogomips /proc/cpuinfo  | tail
bogomips	: 5000.30
bogomips	: 5001.65
bogomips	: 5000.30
bogomips	: 5001.65
bogomips	: 5000.30
bogomips	: 5001.65
bogomips	: 5000.30
bogomips	: 5001.65
bogomips	: 5000.30
bogomips	: 5001.65

[icinga1001:~] $ grep bogomips /proc/cpuinfo  | wc -l
32
[icinga1001:~] $ grep bogomips /proc/cpuinfo  | tail
bogomips	: 4200.00
bogomips	: 4201.31
bogomips	: 4200.00
bogomips	: 4201.31
bogomips	: 4200.00
bogomips	: 4201.31
bogomips	: 4200.00
bogomips	: 4201.31
bogomips	: 4200.00
bogomips	: 4201.31

[einsteinium:~] $ facter processors
{"models"=>["Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz", "Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz"], "count"=>32, "physicalcount"=>2}


[icinga1001:~] $ facter processors
{"models"=>["Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz", "Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz"], "count"=>32, "physicalcount"=>2}
Dzahn added a comment.Oct 26 2018, 5:20 PM

point 11. from the Tuning Guide, Optimize Host Checks, says "Another option would be to use a faster plugin (i.e. check_fping) as the host_check_command instead of check_ping."

That looks like an easy one, we already have fping installed, here's a patch for that:

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469333/

Dzahn added a comment.EditedOct 26 2018, 5:24 PM

What has already been done: The change below made "max_concurrent_checks" configurable via Hiera. It was hardcoded in template to 10000 before and is one o the major factors affecting performance:

The same patch also set it to "0" (unlimited) on icinga1001 while it kept the 10000 setting on einsteinium/tegmen and left those unchanged.

Further comments and refs to the Tuning Guide in the commit message.

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469253/

see, point 7, "Check service latencies to determine best value for maximum concurrent checks." in https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html

Dzahn updated the task description. (Show Details)Oct 26 2018, 5:29 PM
Dzahn added a comment.Oct 26 2018, 5:37 PM

other things i have tried but not puppetized so far:

(point 5 from the tuning guide, Max Reaper Time.)

21:30 mutante: icinga1001 - changing check_result_reaper_frequecy from 10 to 3, trying to lower average check latency. "allow faster check result processing -> requires more CPU" (T202782)

things we could consider but take more work / haven't been tried yet:

  • use passive checks for more than just fundraising, let our hosts report to icinga using NSCA (we already do for frack), and replace some NRPE active checks (point 8 from the guide)
  • compile check_ssl and other Perl scripts to binaries (point 9,Avoid using interpreted plugins.)

things that are already enabled and done:

  • use_large_installation_tweaks=1
  • use_aggressive_host_checking=0

Mentioned in SAL (#wikimedia-operations) [2018-10-26T20:03:36Z] <mutante> icinga1001 - disabled puppet, changed: check_result_reaper_frequency=2 ; max_check_result_reaper_time=10 to test if it lowers latency (T208066)

Change 470077 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: tune check_result_reaper_frequency and _time on stretch

https://gerrit.wikimedia.org/r/470077

Change 470077 merged by Dzahn:
[operations/puppet@production] icinga: tune check_result_reaper_frequency and _time on stretch

https://gerrit.wikimedia.org/r/470077

other things i have tried but not puppetized so far:
(point 5 from the tuning guide, Max Reaper Time.)
21:30 mutante: icinga1001 - changing check_result_reaper_frequecy from 10 to 3, trying to lower average check latency. "allow faster check result processing -> requires more CPU" (T202782)
things we could consider but take more work / haven't been tried yet:

  • use passive checks for more than just fundraising, let our hosts report to icinga using NSCA (we already do for frack), and replace some NRPE active checks (point 8 from the guide)
  • compile check_ssl and other Perl scripts to binaries (point 9,Avoid using interpreted plugins.)

Is there a way to know how long each check_command takes? That would give us some indication on where to start looking for low hanging fruits since AIUI icinga is essentially limited by how fast a single core can process check results and how long checks take to run.

Also we have check_prometheus which is Python 3, thus might take significant CPU if it is running frequently.

Dzahn moved this task from Backlog to In progress on the observability board.Oct 29 2018, 3:01 PM

Change 471298 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-icinga-exporter@master] initial commit

https://gerrit.wikimedia.org/r/471298

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4 has an average latency of 7.153 sec and that is better or the same as before, after our tuning efforts. It was even lower, around 5 seconds , compared to 7.5 at the time of switch , so we had no concerns to move on. Can we close this here or did you want to keep it open to implement the graphs?

I'll re-title the case and claim it to implement the metrics collection.

colewhite renamed this task from Concerns about icinga1001 check latency to Push check latency and check execution time to Prometheus.Nov 17 2018, 12:08 AM
colewhite claimed this task.
colewhite lowered the priority of this task from Normal to Low.

Change 476431 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] role, profile: install, run, and collect icinga exporter metrics

https://gerrit.wikimedia.org/r/476431

Change 471298 merged by Cwhite:
[operations/debs/prometheus-icinga-exporter@master] initial commit

https://gerrit.wikimedia.org/r/471298

Change 477360 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-icinga-exporter@master] incorporating first round of feedback

https://gerrit.wikimedia.org/r/477360

Change 477360 merged by Cwhite:
[operations/debs/prometheus-icinga-exporter@master] incorporating first round of feedback

https://gerrit.wikimedia.org/r/477360

Change 476431 merged by Cwhite:
[operations/puppet@production] role, profile: install, run, and collect icinga exporter metrics

https://gerrit.wikimedia.org/r/476431

Change 478803 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/debs/prometheus-icinga-exporter@master] decode status file as ISO-8859-1

https://gerrit.wikimedia.org/r/478803

Change 478803 merged by Cwhite:
[operations/debs/prometheus-icinga-exporter@master] decode status file as ISO-8859-1

https://gerrit.wikimedia.org/r/478803

Change 478999 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] add icinga_jobs definition

https://gerrit.wikimedia.org/r/478999

Change 478999 merged by Cwhite:
[operations/puppet@production] add icinga_jobs definition

https://gerrit.wikimedia.org/r/478999

colewhite closed this task as Resolved.Dec 11 2018, 6:39 PM
Dzahn awarded a token.Dec 11 2018, 7:22 PM