Page MenuHomePhabricator

Icinga passive checks go awol and downtime stops working
Open, MediumPublic

Description

In the last past days, at least 2 occurrences of the following happened:

At some point, passive checks all start returning:

WARNING: passive check is awol

And downtime actions doesn't seem to no longer take effect (the command is sent and is successful, but the service doesn't effectively get its scheduled downtime).

Checks/alerting in general continue to work as intended.

Last time this happened was 2018-06-04T07:30. A restart seems to fix both issues, at least temporarily. Suspecting some kind of malfunction that gets reseted, but not fixed, on restart, but builds up until it happens again.

Details

Related Gerrit Patches:
operations/puppet : productionicinga: pause nsca on reloads
operations/puppet : productionnsca-fail: add debug task note
operations/puppet : productionsync_icinga_state: kludge nsca stuckness

Event Timeline

jcrespo created this task.Jun 4 2018, 8:20 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 4 2018, 8:20 AM
MoritzMuehlenhoff triaged this task as High priority.Jun 4 2018, 8:22 AM
Gehel added a subscriber: Gehel.Jun 4 2018, 9:04 AM
herron added a subscriber: herron.Jun 4 2018, 3:34 PM
Volans added a subscriber: Jgreen.Jun 5 2018, 7:58 AM

While investigating the possible root causes for this I discovered that we had some new frack hosts just installed last week that were sending metrics although not yet fully configured. In particular they are not present in Icinga hostlist, hence Icinga discards those messages and in theory that shouldn't harm. But to avoid to have too many variables in place I've asked @Jgreen if it was possible to avoid sending the metrics at all until they are fully configured and he very kindly accepted and already implemented.

I've confirmed that we don't have any more discarded external commands log in icinga logs. So if the awol re-happen now means that is totally unrelated to the above. In the meanwhile I will keep having a look to see if anything else comes up.

This happened again today unfortunately. And because I don't see any logs of spurious passive checks from other frack hosts, I guess we have to discard the hypothesis that it might have been that the cause of the issue.

I will start looking into other possible root causes in the next days.

Vvjjkkii renamed this task from Icinga passive checks go awal and downtime stops working to jpbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from jpbaaaaaaa to Icinga passive checks go awal and downtime stops working.Jul 2 2018, 9:38 AM
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@Volans I remember you giving and update on this? Is this still a thing? Could it have been not happening for a while- in which case, it should be closed?

@jcrespo We didn't see any re-occurrence of this recently but it did happened once after the change in frack, so we don't have a root cause honestly. We could resolve it and re-open if it re-happens. I'm open for suggestions.

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:18:19Z] <gehel> restarting icinga on einsteinium - T196336

Potential reocurrence at around 2018-08-10 11:56:02 (passive checks go awal and downtime stops working).

ema added a subscriber: ema.Aug 10 2018, 12:30 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-10T12:18:19Z] <gehel> restarting icinga on einsteinium - T196336

The icinga restart above did fix both awol passive checks and downtiming. In the logs of icinga.service I see multiple instances of errors like Aug 10 11:32:13 einsteinium icinga[1251]: External command error: Malformed command, which might be related to this problem.

Mentioned in SAL (#wikimedia-operations) [2018-08-20T18:05:31Z] <volans> restarting icinga, re-occurrence of T196336

Mentioned in SAL (#wikimedia-operations) [2018-09-07T08:49:45Z] <ema> passive checks awol on einsteinium, restarting icinga -- T196336

Mentioned in SAL (#wikimedia-operations) [2018-09-12T17:38:55Z] <akosiaris> restart icinga T196336

Potential re-ocurrence at around 2018-10-04 08:00 (passive checks go awol)

Mentioned in SAL (#wikimedia-operations) [2018-10-04T08:09:16Z] <marostegui> Restart icinga T196336

Dzahn added a subscriber: Dzahn.Nov 21 2018, 12:16 AM

leaving a note here that Icinga is now on 1.13.4 on stretch rather than 1.11.6 on jessie. the issue we recently saw on T209757 may sound similar but should be unrelated. afaict we have not seen passive checks going awol since the upgrade.

Dzahn closed this task as Resolved.Dec 17 2018, 4:38 PM
Dzahn claimed this task.

Based on further comments by both Jaime and Riccardo that this could be closed and reopened if needed.. and that we have a new version meanwhile and it's been a couple months, i am being bold and closing it.

jcrespo reopened this task as Open.Feb 8 2019, 3:42 PM

We believe we had a new case of this on the 2018-02-08 :-(

Dzahn removed Dzahn as the assignee of this task.EditedFeb 8 2019, 4:42 PM

Was it really both "passive checks awol" and "downtime stopped working" at the same time or just one of them?

The thing that we knew is "passive checks awol", then restart, then gone. We didn't test downtiming.

Jgreen added a comment.Feb 9 2019, 5:01 PM

Passive checks went awol again today (2019-02-09) starting around 16:25 UTC. I watched it for a few minutes and observed, on icinga1001

  • loadavg overall on the server was in the 7-9 range
  • icinga process looked CPU bound, hovering around 99% for the main process
  • icinga process looked pretty busy running nrpe checks
  • I saved some strace output to /root/icinga.strace.20190209 in case that's interesting
  • nsca process looked active but not particularly busy judging from strace output

I didn't think to check re. downtime, sorry!

I restarted the icinga process and everything recovered quickly.

I got also an email alert from our external monitoring today and upon checking icinga1001 I noticed that the uptime coincide with a reboot around that time. See T214760#4941030 for more details. Unclear if related at this point.

All the passive checks went awol just now.
I tested a downtime to db1100 and it didn't work (either using icinga-downtime or the icinga web ui)
While tailing the logs during the downtimes test I got:

[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command
[1549868178] External command error: Malformed command

I restarted icinga and they are recovering and downtimes are working again

jijiki added a subscriber: jijiki.Feb 11 2019, 7:40 AM
CDanis renamed this task from Icinga passive checks go awal and downtime stops working to Icinga passive checks go awol and downtime stops working.Mar 22 2019, 9:12 PM
CDanis lowered the priority of this task from High to Medium.

Change 499028 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] sync_icinga_state: kludge nsca stuckness

https://gerrit.wikimedia.org/r/499028

Change 499028 merged by CDanis:
[operations/puppet@production] sync_icinga_state: kludge nsca stuckness

https://gerrit.wikimedia.org/r/499028

CDanis added a subscriber: CDanis.Mar 26 2019, 1:39 PM

Yesterday I had a fun adventure through UNIX internals diagnosing why the secondary icinga host winds up with lots of stuck nsca processes and fixed that with r499028 above.

While that was fun and interesting (and I might do a writeup on it), I don't think it gets to the root cause of this issue.

What I fixed is some nsca subprocesses getting stuck when Icinga is restarted (in particular a stop followed by a start; a reload sends SIGHUP which I believe should be fine). But in some sense those don't matter; each subprocess is a single instance of an incoming check result, and the daemon that is accepting connections and forking those subprocesses happily continues its work. New check results continue to flow.

However this doesn't really explain why Icinga sometimes stops processing external commands altogether. I think that is likely an entirely different problem with a different underlying cause.

More debugging is needed here.

Change 499215 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] nsca-fail: add debug task note

https://gerrit.wikimedia.org/r/499215

Change 499215 merged by CDanis:
[operations/puppet@production] nsca-fail: add debug task note

https://gerrit.wikimedia.org/r/499215

Mentioned in SAL (#wikimedia-operations) [2019-04-08T09:19:58Z] <volans> restarting icinga on icinga1001 - T196336

Volans added a comment.Apr 8 2019, 9:31 AM

The log from today makes me thing that there is some sort of race-condition when we reload icinga (triggered by puppet usually) and the passive checks coming in from NSCA.

I found many NSCA processes that I had to "unstuck" with the same technique we use in the sync_icinga_state script.

This is syslog:

Apr  8 08:45:46 icinga1001 puppet-agent[64679]: (/Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_services.cfg]) Scheduling refresh of Service[icinga]
[... SNIP ...]
Apr  8 08:46:16 icinga1001 icinga[133044]: Reloading icinga monitoring daemon configuration files: icinga.
Apr  8 08:46:16 icinga1001 icinga: Caught SIGHUP, restarting...
Apr  8 08:46:16 icinga1001 systemd[1]: Reloaded LSB: icinga host/service/network monitoring and management system.
Apr  8 08:46:16 icinga1001 puppet-agent[64679]: (/Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]) Triggered 'refresh' from 1 events
Apr  8 08:46:16 icinga1001 icinga: Icinga 1.13.4 starting... (PID=139254)
Apr  8 08:46:16 icinga1001 icinga: Local time is Mon Apr 08 08:46:16 UTC 2019
Apr  8 08:46:16 icinga1001 icinga: LOG VERSION: 2.0
Apr  8 08:46:19 icinga1001 puppet-agent[64679]: Applied catalog in 52.08 seconds
Apr  8 08:46:34 icinga1001 icinga: Event loop started...
Apr  8 08:46:34 icinga1001 icinga: SERVICE ALERT: ms-be1032;HP RAID;UNKNOWN;SOFT;1;(Service Check Timed Out)
Apr  8 08:46:34 icinga1001 icinga: SERVICE EVENT HANDLER: ms-be1032;HP RAID;UNKNOWN;SOFT;1;raid_handler!hpssacli!eqiad
Apr  8 08:46:34 icinga1001 icinga: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;americium;check_load;0;OK - load average: 1.91, 1.64, 1.56|load1=1.910;15.000;30.000;0; load5=1.640;10.000;25.000;0; load15=1.560;5.000;20.000;0;
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command
Apr  8 08:46:34 icinga1001 icinga: External command error: Malformed command

One thing we could try is to make the icinga refresh from Puppet a bit more smart and also reload or do the same trick to NCSA, like -STOP first, then reload icinga, then -CONT. Not sure as NCSA will create a new process for each incoming external check, maybe it should be stopped while Icinga is reloading?

Thoughts?

Mentioned in SAL (#wikimedia-operations) [2019-04-09T18:38:01Z] <cdanis> T196336 cdanis@icinga1001$ sudo systemctl restart nsca

Mentioned in SAL (#wikimedia-operations) [2019-04-09T18:42:12Z] <volans> restart icinga on icinga1001 - T196336

Mentioned in SAL (#wikimedia-operations) [2019-04-18T06:58:47Z] <moritzm> restarting icinga on icinga1001 (T196336)

similar again

Apr 17 21:20:17 icinga1001 icinga[26588]: Reloading icinga monitoring daemon configuration files: icinga.
Apr 17 21:20:17 icinga1001 systemd[1]: Reloaded LSB: icinga host/service/network monitoring and management system.
Apr 17 21:20:17 icinga1001 icinga: Caught SIGHUP, restarting...
Apr 17 21:20:17 icinga1001 puppet-agent[228998]: (/Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]) Triggered 'refresh' from 1 events
Apr 17 21:20:18 icinga1001 icinga: Icinga 1.13.4 starting... (PID=43112)
Apr 17 21:20:18 icinga1001 icinga: Local time is Wed Apr 17 21:20:18 UTC 2019
Apr 17 21:20:18 icinga1001 icinga: LOG VERSION: 2.0
Apr 17 21:20:20 icinga1001 puppet-agent[228998]: Applied catalog in 47.22 seconds
Apr 17 21:20:36 icinga1001 icinga: Event loop started...
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdata1001;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:06 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: civi1001;check_raid;0;OK: LinuxRAID /dev/md/0: act=2, wk=2, fail=0, sp=0: /dev/md/1: act=2, wk=2, fail=0, sp=0: /dev/md/2: act=2, wk
=2, fail=0, sp=0: /dev/md/3: act=2, wk=2, fail=0, sp=0
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdev1001;check_ssl;0;SSL OK - Certificate *.frdev.wikimedia.org valid until 2020-04-12 04:13:19 +0000 (expires in 360 days)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frbackup2001;check_swap;0;SWAP OK - 98% free (7476 MB out of 7628 MB)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: alnitak;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:05 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: mintaka;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:04 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frbackup2001;check_timesync;0;OK: synced at Wed 2019-04-17 21:20:08 UTC.
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: civi1001;check_rsyslog_backlog;0;OK
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: mintaka;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdev1001;check_swap;0;SWAP OK - 100% free (7629 MB out of 7629 MB)
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: frdata1001;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: PASSIVE SERVICE CHECK: alnitak;check_zombie;0;PROCS OK: 0 processes with STATE = Z
Apr 17 21:20:36 icinga1001 icinga: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;frbackup2001;check_zombie;0;PROCS OK: 0 processes with STATE = Z | procs=0;5;50;0;
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command
Apr 17 21:20:36 icinga1001 icinga: External command error: Malformed command

Change 504898 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] icinga: pause nsca on reloads

https://gerrit.wikimedia.org/r/504898

Change 504898 merged by CDanis:
[operations/puppet@production] icinga: pause nsca on reloads

https://gerrit.wikimedia.org/r/504898