Page MenuHomePhabricator

Netbox reports Icinga checks timeout
Closed, ResolvedPublic

Description

The Netbox reports Icinga NRPE checks are often timing out (10s timeout), and if their state is critical that means flapping between states (critical -> unknown -> critical) creating spam on the IRC channel. See [1] for the Icinga log of one of them (others are the same).
The puppetdb one is by far the most noisy alert on IRC in the last month according to [2] by itself, even more so if we sum all the Netbox report ones.

The check_netbox_report.py has a comment that says it has to get all reports objects each time also if checking only one of them, but it seems to me that a simple get() includes the result.failed property that we're calling in the Icinga check. Is there anything else missing?

IMHO there are some major improvements that could be done here:

  1. The code could be vastly simplified removing the support of checking multiple reports at once, that AFAIK we are not using and we always call it with a single report as parameter.
  2. Instead of getting all reports from the API, getting only the one we need to check should speed up quite a bit the API call
  3. We are now in a weird situation in which we have a script called check_netbox_report.py that is run both by Icinga via NRPE (as it should) and via systemd timers with the --run option to actually run the report. I think at this point those two independent actions should be split into two different scripts, having the NRPE check only checking the status and the systemd timers ones only running the report.

[1] https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=netbox1001&service=Check+the+Netbox+report+puppetdb+for+fail+status.
[2] https://logstash.wikimedia.org/app/kibana#/dashboard/AWm67Kpk8aQffZ3HmRpW?_g=h@ef324c0&_a=h@9330484

Details

Related Gerrit Patches:

Related Objects

StatusSubtypeAssignedTask
Resolvedcrusnov

Event Timeline

Volans created this task.Nov 9 2019, 12:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2019, 12:19 PM
Volans triaged this task as High priority.Nov 9 2019, 12:19 PM

This is an excerpt of the backlog overnight:

01:17 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
02:31 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
02:43 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
03:00 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
03:23 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
04:14 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
04:48 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
05:16 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
06:19 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
06:42 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
07:33 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
09:56 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence
11:11 < icinga-wm> PROBLEM - Netbox report coherence. on netbox1001 is CRITICAL: coherence.Coherence CRITICAL https://netbox.wikimedia.org/extras/reports/coherence.Coherence

@crusnov, what's the latest here? Let's please fix this ASAP.

Change 552154 had a related patch set uploaded (by CRusnov; owner: CRusnov):
[operations/puppet@production] netbox report alerting: Simplify icinga check and cleanup

https://gerrit.wikimedia.org/r/552154

The above patch should address these issues. It hugely simplifies the nagios check script and also uses the API more efficiently so it shouldn't flap anymore on a failed report.

Change 552154 merged by CRusnov:
[operations/puppet@production] netbox report alerting: Simplify icinga check and cleanup

https://gerrit.wikimedia.org/r/552154

What's the status of this task?

crusnov closed this task as Resolved.Nov 25 2019, 4:05 PM

I executed the plan that Riccardo outlined, removed the running ability in the check and switched to running from the management script, which has simplified the code a bit, although the real causes of the timeouts were that Netbox initializes all of the report objects when you query the .all for the reports list, which for accounting, librenms, and puppetdb involve actually accessing a remote service with unpredictable amounts of time involved. I switched the icinga check to .get the report object instead, so we only eat the unpredictability of one report which for the time being appears to be under the 10 second limit. I'm opening an additional ticket to try to defensively restructure the reports so they don't actually access external services unless they are used so to reduce any possibility of this happening (and also reduce the possibility of a broken external service preventing looking at the report list in the interface).

TL;DR The checks do not seem to time out anymore. Setting as resolved.