Page MenuHomePhabricator

Icinga passive checks go awol and downtime stops working
Open, MediumPublic

Description

In the last past days, at least 2 occurrences of the following happened:

At some point, passive checks all start returning:

WARNING: passive check is awol

And downtime actions doesn't seem to no longer take effect (the command is sent and is successful, but the service doesn't effectively get its scheduled downtime).

Checks/alerting in general continue to work as intended.

Last time this happened was 2018-06-04T07:30. A restart seems to fix both issues, at least temporarily. Suspecting some kind of malfunction that gets reseted, but not fixed, on restart, but builds up until it happens again.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

While investigating the possible root causes for this I discovered that we had some new frack hosts just installed last week that were sending metrics although not yet fully configured. In particular they are not present in Icinga hostlist, hence Icinga discards those messages and in theory that shouldn't harm. But to avoid to have too many variables in place I've asked @Jgreen if it was possible to avoid sending the metrics at all until they are fully configured and he very kindly accepted and already implemented.

I've confirmed that we don't have any more discarded external commands log in icinga logs. So if the awol re-happen now means that is totally unrelated to the above. In the meanwhile I will keep having a look to see if anything else comes up.

This happened again today unfortunately. And because I don't see any logs of spurious passive checks from other frack hosts, I guess we have to discard the hypothesis that it might have been that the cause of the issue.

I will start looking into other possible root causes in the next days.

@Volans I remember you giving and update on this? Is this still a thing? Could it have been not happening for a while- in which case, it should be closed?

@jcrespo We didn't see any re-occurrence of this recently but it did happened once after the change in frack, so we don't have a root cause honestly. We could resolve it and re-open if it re-happens. I'm open for suggestions.

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:18:19Z] <gehel> restarting icinga on einsteinium - T196336

Potential reocurrence at around 2018-08-10 11:56:02 (passive checks go awal and downtime stops working).

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:18:19Z] <gehel> restarting icinga on einsteinium - T196336

The icinga restart above did fix both awol passive checks and downtiming. In the logs of icinga.service I see multiple instances of errors like Aug 10 11:32:13 einsteinium icinga[1251]: External command error: Malformed command, which might be related to this problem.

Mentioned in SAL (#wikimedia-operations) [2018-08-20T18:05:31Z] <volans> restarting icinga, re-occurrence of T196336

Mentioned in SAL (#wikimedia-operations) [2018-09-07T08:49:45Z] <ema> passive checks awol on einsteinium, restarting icinga -- T196336

Potential re-ocurrence at around 2018-10-04 08:00 (passive checks go awol)

leaving a note here that Icinga is now on 1.13.4 on stretch rather than 1.11.6 on jessie. the issue we recently saw on T209757 may sound similar but should be unrelated. afaict we have not seen passive checks going awol since the upgrade.

Dzahn claimed this task.

Based on further comments by both Jaime and Riccardo that this could be closed and reopened if needed.. and that we have a new version meanwhile and it's been a couple months, i am being bold and closing it.

We believe we had a new case of this on the 2018-02-08 :-(

Dzahn removed Dzahn as the assignee of this task.EditedFeb 8 2019, 4:42 PM

Was it really both "passive checks awol" and "downtime stopped working" at the same time or just one of them?

The thing that we knew is "passive checks awol", then restart, then gone. We didn't test downtiming.

Passive checks went awol again today (2019-02-09) starting around 16:25 UTC. I watched it for a few minutes and observed, on icinga1001

  • loadavg overall on the server was in the 7-9 range
  • icinga process looked CPU bound, hovering around 99% for the main process
  • icinga process looked pretty busy running nrpe checks
  • I saved some strace output to /root/icinga.strace.20190209 in case that's interesting
  • nsca process looked active but not particularly busy judging from strace output

I didn't think to check re. downtime, sorry!

I restarted the icinga process and everything recovered quickly.

I got also an email alert from our external monitoring today and upon checking icinga1001 I noticed that the uptime coincide with a reboot around that time. See T214760#4941030 for more details. Unclear if related at this point.

All the passive checks went awol just now.
I tested a downtime to db1100 and it didn't work (either using icinga-downtime or the icinga web ui)
While tailing the logs during the downtimes test I got:

[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command

I restarted icinga and they are recovering and downtimes are working again

CDanis renamed this task from Icinga passive checks go awal and downtime stops working to Icinga passive checks go awol and downtime stops working.Mar 22 2019, 9:12 PM
CDanis lowered the priority of this task from High to Medium.

Change 499028 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] sync_icinga_state: kludge nsca stuckness

https://gerrit.wikimedia.org/r/499028

Change 499028 merged by CDanis:
[operations/puppet@production] sync_icinga_state: kludge nsca stuckness

https://gerrit.wikimedia.org/r/499028

Yesterday I had a fun adventure through UNIX internals diagnosing why the secondary icinga host winds up with lots of stuck nsca processes and fixed that with r499028 above.

While that was fun and interesting (and I might do a writeup on it), I don't think it gets to the root cause of this issue.

What I fixed is some nsca subprocesses getting stuck when Icinga is restarted (in particular a stop followed by a start; a reload sends SIGHUP which I believe should be fine). But in some sense those don't matter; each subprocess is a single instance of an incoming check result, and the daemon that is accepting connections and forking those subprocesses happily continues its work. New check results continue to flow.

However this doesn't really explain why Icinga sometimes stops processing external commands altogether. I think that is likely an entirely different problem with a different underlying cause.

More debugging is needed here.

Change 499215 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] nsca-fail: add debug task note

https://gerrit.wikimedia.org/r/499215

Change 499215 merged by CDanis:
[operations/puppet@production] nsca-fail: add debug task note

https://gerrit.wikimedia.org/r/499215

Mentioned in SAL (#wikimedia-operations) [2019-04-08T09:19:58Z] <volans> restarting icinga on icinga1001 - T196336

The log from today makes me thing that there is some sort of race-condition when we reload icinga (triggered by puppet usually) and the passive checks coming in from NSCA.

I found many NSCA processes that I had to "unstuck" with the same technique we use in the sync_icinga_state script.

This is syslog:

Apr  8 08:45:46 icinga1001 puppet-agent[64679]: (/Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_services.cfg]) Scheduling refresh of Service[icinga]
[... SNIP ...]
Apr  8 08:46:16 icinga1001 icinga[133044]: Reloading icinga monitoring daemon configuration files: icinga.
Apr  8 08:46:16 icinga1001 icinga: Caught SIGHUP, restarting...
Apr  8 08:46:16 icinga1001 systemd[1]: Reloaded LSB: icinga host/service/network monitoring and management system.
Apr  8 08:46:16 icinga1001 puppet-agent[64679]: (/Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]) Triggered 'refresh' from 1 events
Apr  8 08:46:16 icinga1001 icinga: Icinga 1.13.4 starting... (PID=139254)
Apr  8 08:46:16 icinga1001 icinga: Local time is Mon Apr 08 08:46:16 UTC 2019
Apr  8 08:46:16 icinga1001 icinga: LOG VERSION: 2.0
Apr  8 08:46:19 icinga1001 puppet-agent[64679]: Applied catalog in 52.08 seconds
Apr  8 08:46:34 icinga1001 icinga: Event loop started...
Apr  8 08:46:34 icinga1001 icinga: SERVICE ALERT: ms-be1032;HP RAID;UNKNOWN;SOFT;1;(Service Check Timed Out)
Apr  8 08:46:34 icinga1001 icinga: SERVICE EVENT HANDLER: ms-be1032;HP RAID;UNKNOWN;SOFT;1;raid_handler!hpssacli!eqiad
Apr  8 08:46:34 icinga1001 icinga: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;americium;check_load;0;OK - load average: 1.91, 1.64, 1.56|load1=1.910;15.000;30.000;0; load5=1.640;10.000;25.000;0; load15=1.560;5.000;20.000;0;
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command

One thing we could try is to make the icinga refresh from Puppet a bit more smart and also reload or do the same trick to NCSA, like -STOP first, then reload icinga, then -CONT. Not sure as NCSA will create a new process for each incoming external check, maybe it should be stopped while Icinga is reloading?

Thoughts?

Mentioned in SAL (#wikimedia-operations) [2019-04-09T18:38:01Z] <cdanis> T196336 cdanis@icinga1001$ sudo systemctl restart nsca

Mentioned in SAL (#wikimedia-operations) [2019-04-09T18:42:12Z] <volans> restart icinga on icinga1001 - T196336

Mentioned in SAL (#wikimedia-operations) [2019-04-18T06:58:47Z] <moritzm> restarting icinga on icinga1001 (T196336)

similar again

Apr 17 21:20:17 icinga1001 icinga[26588]: Reloading icinga monitoring daemon configuration files: icinga.
Apr 17 21:20:17 icinga1001 systemd[1]: Reloaded LSB: icinga host/service/network monitoring and management system.
Apr 17 21:20:17 icinga1001 icinga: Caught SIGHUP, restarting...
Apr 17 21:20:17 icinga1001 puppet-agent[228998]: (/Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]) Triggered 'refresh' from 1 events
Apr 17 21:20:18 icinga1001 icinga: Icinga 1.13.4 starting... (PID=43112)
Apr 17 21:20:18 icinga1001 icinga: Local time is Wed Apr 17 21:20:18 UTC 2019
Apr 17 21:20:18 icinga1001 icinga: LOG VERSION: 2.0
Apr 17 21:20:20 icinga1001 puppet-agent[228998]: Applied catalog in 47.22 seconds
Apr 17 21:20:36 icinga1001 icinga: Event loop started...
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdata1001;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:06 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: civi1001;check_raid;0;OK: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=2, wk
=2, fail=0, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdev1001;check_ssl;0;SSL OK - Certificate *.frdev.wikimedia.org valid until 2020-04-12 04:13:19 +0000 (expires in 360 days)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frbackup2001;check_swap;0;SWAP OK - 98% free (7476 MB out of 7628 MB)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: alnitak;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:05 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: mintaka;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:04 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frbackup2001;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:08 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: civi1001;check_rsyslog_backlog;0;OK
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: mintaka;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdev1001;check_swap;0;SWAP OK - 100% free (7629 MB out of 7629 MB)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdata1001;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: alnitak;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;frbackup2001;check_zombie;0;PROCS OK: 0 processes with STATE = Z | procs=0;5;50;0;
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command

Change 504898 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] icinga: pause nsca on reloads

https://gerrit.wikimedia.org/r/504898

Change 504898 merged by CDanis:
[operations/puppet@production] icinga: pause nsca on reloads

https://gerrit.wikimedia.org/r/504898

Volans claimed this task.

Resolving as this is an old task and that issue has been fixed, despite we've a similar one right now.

Mentioned in SAL (#wikimedia-operations) [2022-08-01T17:06:08Z] <mutante> alert1001 - systemctl restart nsca - pinged by fundraising tech because fundraising hosts have the "passive check is awol" issue again (T196336)

Yes, just restarting the service fixes it. Passive checks are coming in and turning OK again.

Dwisehaupt reopened this task as Open.EditedAug 1 2022, 5:50 PM
Dwisehaupt subscribed.

We had another instance of this today that wasn't solved by restarting the service. mutante was kind enough to restart and try to debug live with us. As we were observing the status of checks we would see the service restarting independent of our attempts. Relevant log lines should be between 2022-08-01 16:00:00 and 2022-08-01 18:00:00. The problem isn't resolved for today yet.

We did see these entries in our logs, but they are not unique and have happened in the past:

user.warning: 2022-08-01T15:57:12.813867+00:00 frdb2001 nagios_nsca[8385]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:13.889435+00:00 frdb2003 nagios_nsca[392116]: STDERR [208.80.154.88] Error: 
user.warning: 2022-08-01T15:57:14.073227+00:00 frpm2001 nagios_nsca[25048]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:14.850808+00:00 frdb1003 nagios_nsca[2327]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:15.056448+00:00 frban2001 nagios_nsca[22520]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:15.158616+00:00 civi2001 nagios_nsca[17866]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:16.014029+00:00 frbackup2002 nagios_nsca[272650]: STDERR [208.80.154.88] Error: 
user.warning: 2022-08-01T15:57:16.048736+00:00 pay-lvs1003 nagios_nsca[14973]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:16.079603+00:00 frdev1002 nagios_nsca[26368]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:16.161076+00:00 frqueue1004 nagios_nsca[9153]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:16.241847+00:00 frmx1001 nagios_nsca[28252]: [208.80.154.88] Error: Timeout after %d seconds
user.warning: 2022-08-01T15:57:16.635599+00:00 payments1005 nagios_nsca[30048]: [208.80.154.88] Error: Timeout after %d seconds

and:

user.warning: 2022-08-01T13:55:26.522219+00:00 frdb1005 nagios_nsca[381982]: STDERR [208.80.153.84] Error: 
user.warning: 2022-08-01T13:55:26.529264+00:00 fran2001 nagios_nsca[32763]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:26.530668+00:00 frauth2001 nagios_nsca[11261]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:26.531333+00:00 frlog2001 nagios_nsca[19886]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:27.427943+00:00 civi1002 nagios_nsca[3032231]: STDERR [208.80.153.84] Error: 
user.warning: 2022-08-01T13:55:28.430978+00:00 frdev1001 nagios_nsca[9701]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:29.448834+00:00 frmx2001 nagios_nsca[5544]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:30.435274+00:00 frdata1002 nagios_nsca[1881]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:30.436984+00:00 payments1007 nagios_nsca[27068]: [208.80.153.84] Error: Timeout after %d seconds
user.warning: 2022-08-01T13:55:31.437047+00:00 civi1001 nagios_nsca[20743]: [208.80.153.84] Error: Timeout after %d seconds

So.. on alert1001.wikimedia.org in tail -f /var/log/icinga/icinga.log you can see it both. A lot of:

[1659380327] SERVICE ALERT: frdev1002;check_rsyslog_backlog;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frdev1002;check_timesync;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frdev1002;check_zombie;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frlog1001;check_disk;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frmon2001;check_disk;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frmon2001;check_load;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frmon2001;check_memory;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
[1659380327] SERVICE ALERT: frmon2001;check_procs;WARNING;SOFT;1;WARNING: passive check is awol - see T196336
..
[1659380311] Warning: The results of service 'check_disk' on host 'payments1008' are stale by 0d 0h 27m 21s (threshold=0d 0h 6m 0s).  I'm forcing an immediate check of the service.
[1659380311] Warning: The results of service 'check_endpoints' on host 'payments1008' are stale by 0d 0h 27m 20s (threshold=0d 0h 6m 0s).  I'm forcing an immediate check of the service.
[1659380311] Warning: The results of service 'check_load' on host 'payments1008' are stale by 0d 0h 27m 19s (threshold=0d 0h 6m 0s).  I'm forcing an immediate check of the service.
[1659380311] Warning: The results of service 'check_payments_wiki' on host 'payments2001' are stale by 0d 0h 29m 16s (threshold=0d 0h 6m 0s).  I'm forcing an immediate check of the service.

but also a lot of

[1659380303] PASSIVE SERVICE CHECK: civi1002;check_memory;0;OK Memory 1% used
[1659380303] SERVICE ALERT: civi1002;check_memory;OK;SOFT;2;OK Memory 1% used
[1659380303] PASSIVE SERVICE CHECK: frlog2001;check_rsyslog_backlog;0;OK
[1659380303] SERVICE ALERT: frlog2001;check_rsyslog_backlog;OK;SOFT;2;OK
[1659380303] PASSIVE SERVICE CHECK: frdb1005;check_memory;0;OK Memory 73% used
[1659380303] SERVICE ALERT: frdb1005;check_memory;OK;SOFT;2;OK Memory 73% used
[1659380303] PASSIVE SERVICE CHECK: frauth2001;check_procs;0;PROCS OK: 223 processes
[1659380303] SERVICE ALERT: frauth2001;check_procs;OK;SOFT;2;PROCS OK: 223 processes

when it works just fine.

Doing a systemctl restart nsca (the daemon that passively accepts check results) makes it recover but only for a short time.

In general it feels like it's just constantly flapping between working just fine and being broken.

Since these are just UDP packets set from frack over to alert1001.. we don't really know if they arrive. it could have a bunch of reasons.. including networking.

Note that icinga itself (not nsca) also gets restarted by something a lot.. which I noticed in the logs:

[1659380345] Caught SIGHUP, restarting...
[1659380345] Icinga 1.14.2 starting... (PID=3942)

Finally we have these cases where a host IS sending packets but does not exist in Icinga (yet?).

[1659380312] Warning:  Passive check result was received for service 'check_timesync' on host 'fran2001', but the host could not be found!

P.S. If you are wondering why these are passive checks and why it's all a special case for fundraising hosts.. this way we don't need to make connections from prod to frack and open firewall holes.. instead prod can just sit there and wait for incoming "news" from frack.

This is good for PCI compliance afair.. and it might also be a better way of monitoring for other non-frack things. It is a way to scale Icinga and reduce load on it.

Dzahn removed Volans as the assignee of this task.Aug 1 2022, 7:07 PM

Finally we have these cases where a host IS sending packets but does not exist in Icinga (yet?).

I found and removed 3 cases where a new host was sending packets before we got it into icinga config, they were fran2001, frlog1002, and frdb1006.

Sanity check, not related to T164206#3434924:

alert1001# df -h
Filesystem            Size  Used Avail Use% Mounted on
[...]
none                  1.0G   88M  937M   9% /var/icinga-tmpfs

also see T314353#8122412 - feels to me like these can both be the general load and latence on the icinga server