Average check latency on icinga1001 [1] appears to swing between 6 and 15 seconds. Latency larger than 10 seconds is not ideal [2].
Daniel and I have implemented some tweaks to help alleviate the problem, but cannot definitively answer whether or not it was successful. I think it might be worthwhile to gather these metrics into grafana as all the evidence so far is fairly anecdotal.
[1] https://icinga-stretch.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4
[2] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html
The "max_concurrent_checks" setting is a major factor for performance and is now configurable in Hiera and set to "0" (unlimited) on the new server while it is set to 10000 on the old server. [3].
Reference is made to section 7 in the tuning guide [4]. which says "If you are seeing high latency values (> 10 or 15 seconds) for the majority of your service checks (via the extinfo CGI), you are probably starving Nagios of the checks it needs."
[3] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/469253/
[4] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/tuning.html