Page MenuHomePhabricator

Icinga latency is skyrocketing and commands ignored
Closed, ResolvedPublic

Description

The latency in Icinga is recently skyrocketing, see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=4

We often have an average check latency over 60s and a max over 90.
In addition to this, sometimes Icinga stop processing the commands to the command file, silently ignoring new donwtimes and making all frack passive checks go AWOL.

As also the passive host shows the same behaviour, I'm starting to think that we're reaching some limitation of Icinga and/or the current setup and we should investigate it to find it and possibly resolve it as any eventual replacement of Icinga is far down the road and the current installation must work for now.

Event Timeline

Volans triaged this task as High priority.Mar 12 2020, 4:38 PM
Volans created this task.

The main icinga daemon process, which AIUI is basically a single-threaded event loop, is pretty constantly near-or-at 100% CPU utilization, btw.

Change 579329 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: switch check_ping packet count to 1

https://gerrit.wikimedia.org/r/579329

Mentioned in SAL (#wikimedia-operations) [2020-03-17T13:30:19Z] <godog> stop puppet and turn on debug on icinga2001 - T247538

Change 580327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: relax interval for selected checks

https://gerrit.wikimedia.org/r/580327

Change 580327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: relax interval for selected checks

https://gerrit.wikimedia.org/r/580327

See also this review for low-hanging fruits/checks that we can relax, the top 20 is roughly like this (debug file gets rotated)

# awk -F\' '/scheduled check/ {print $2}' /var/log/icinga/icinga.debug* | sort | uniq -c | sort -nr | head -20
   2367 SSH
   2054 Check systemd state
   1935 configured eth
   1916 DPKG
   1852 Check whether ferm is active by checking the default input chain
   1835 Check size of conntrack table
   1827 Disk space
   1792 dhclient process
   1457 MD RAID
    689 Check that envoy is running
    676 PHP7 rendering
    657 mcrouter process
    604 puppet last run
    604 nutcracker process
    567 MediaWiki EtcdConfig up-to-date
    381 nutcracker socket
    373 php7.2-fpm service
    365 PHP opcache health
    339 Nginx local proxy to apache
    326 Apache HTTP

Change 579329 merged by Herron:
[operations/puppet@production] icinga: switch check_ping packet count to 2

https://gerrit.wikimedia.org/r/579329

Change 580347 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] prometheus: add icinga average latency checks

https://gerrit.wikimedia.org/r/580347

Change 580327 merged by Filippo Giunchedi:
[operations/puppet@production] base: relax interval for selected checks

https://gerrit.wikimedia.org/r/580327

The new baseline in eqiad for average check latency is ~70s, which isn't great IMHO but certainly better. Short of deploying more powerful hardware I think we can continue looking for low hanging fruits and lessen the load in terms of checks that icinga needs to run.

Top 50 checks as of today, with a little longer time horizon than the previous audit

root@icinga2001:/var/log/icinga# awk -F\' '/scheduled check/ {print $2}' /var/log/icinga/icinga.debug* | sort | uniq -c | sort -nr | head -50
  30087 SSH
  29924 Disk space
  29695 Check systemd state
  27239 Check size of conntrack table
  20745 MD RAID
   9830 Check that envoy is running
   9345 puppet last run
   8807 nutcracker process
   8702 php7.2-fpm service
   8626 mcrouter process
   8571 PHP opcache health
   8441 PHP7 rendering
   8233 nutcracker socket
   7268 MediaWiki EtcdConfig up-to-date
   7076 Nginx local proxy to apache
   7035 Apache HTTP
   2655 confd service
   2639 Ensure local MW versions match expected deployment
   1978 dhclient process
   1903 DPKG
   1859 configured eth
   1852 Check the NTP synchronisation status of timesyncd
   1758 EDAC syslog messages
   1743 MariaDB disk space #page
   1731 Memory correctable errors -EDAC-
   1728 mysqld processes #page
   1677 mysqld processes
   1664 Device not healthy -SMART-
   1635 Check whether ferm is active by checking the default input chain
   1630 IPMI Sensor Status
   1623 MariaDB disk space
   1576 very high load average likely xfs
   1503 Check the last execution of php7.2-fpm_check_restart
   1488 check_trafficserver_backend_config_status
   1488 Freshness of OCSP Stapling files -ATS-TLS acme-chief-
   1488 Default ATS Lua configuration file
   1487 Varnish HTCP daemon
   1487 Ensure traffic_server is running for instance backend
   1486 ats-tls HTTPS en.wikipedia.org ECDSA
   1486 Ensure traffic_manager binds on 3128 and responds to HTTP requests
   1484 Ensure traffic_manager is running for instance backend
   1483 check_trafficserver_log_fifo_notpurge_backend
   1483 Logs skipped by trafficserver-tls
   1483 Ensure traffic_exporter binds on port 9122 and responds to HTTP requests
   1482 Freshness of OCSP Stapling files -ATS-TLS-
   1478 traffic_server tls process restarted
   1478 Ensure traffic_manager is running for instance tls
   1477 ats-tls HTTPS en.wikipedia.org RSA
   1477 TLS Lua configuration file
   1476 Ensure traffic_server is running for instance tls

Another low-hanging fruit is to reduce the SSH check for the mgmts I think: It currently runs every minute, but the non-avail of the mgmt sshd has no end-user impact, so checking them hourly should be good enough? That would slash the 30087 checks from above by a lot.

Change 580985 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] icinga: relax check interval for selected checks

https://gerrit.wikimedia.org/r/580985

Change 580989 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] icinga: change check/retry interval of mgmt host check to 10/15

https://gerrit.wikimedia.org/r/580989

Change 580347 merged by Filippo Giunchedi:
[operations/puppet@production] prometheus: add icinga average latency checks

https://gerrit.wikimedia.org/r/580347

Another low-hanging fruit is to reduce the SSH check for the mgmts I think: It currently runs every minute, but the non-avail of the mgmt sshd has no end-user impact, so checking them hourly should be good enough? That would slash the 30087 checks from above by a lot.

The mgmt ssh and dns checks look to have an interval/retry of 60/60 currently, but the mgmt host ping check doesn't yet have a check_interval set which looks like similarly low-hanging fruit. Uploaded https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/580989/

Change 580989 abandoned by Herron:
icinga: change check/retry interval of mgmt host check to 10/15

Reason:
tabling this for now

https://gerrit.wikimedia.org/r/580989

fgiunchedi lowered the priority of this task from High to Medium.Mar 26 2020, 1:58 PM

Lowering priority as things I believe are better now, pending https://gerrit.wikimedia.org/r/c/operations/puppet/+/580985 as the last attempt at lowering check latency further.

Change 580985 merged by Filippo Giunchedi:
[operations/puppet@production] icinga: relax check interval for selected checks

https://gerrit.wikimedia.org/r/580985

fgiunchedi claimed this task.

With https://gerrit.wikimedia.org/r/580985 merged I'm resolving this task since check latency is doing better now, and we're alerting on excessive latency.