Page MenuHomePhabricator

Icinga loses downtime entries, causing alert and page spam
Closed, ResolvedPublic

Description

For example:

	May 01, 2017 16:00	

Service Ok[2017-05-01 16:27:32] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;OK;HARD;10;OK slave_sql_lag Replication lag: 0.00 seconds
Service Warning[2017-05-01 16:25:32] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;WARNING;HARD;10;WARNING slave_sql_lag Replication lag: 216.50 seconds
Program Start[2017-05-01 16:21:20] Icinga 1.11.6 starting... (PID=48297)
Program Restart[2017-05-01 16:21:19] Caught SIGHUP, restarting...
Service Ok[2017-05-01 16:15:19] SERVICE ALERT: db1092;MariaDB Slave IO: s5;OK;HARD;3;OK slave_io_state Slave_IO_Running: Yes

May 01, 2017 15:00	

Service Critical[2017-05-01 15:56:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;HARD;10;CRITICAL slave_sql_lag Replication lag: 639.81 seconds
Service Critical[2017-05-01 15:55:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;SOFT;9;CRITICAL slave_sql_lag Replication lag: 579.44 seconds
Service Critical[2017-05-01 15:54:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;SOFT;8;CRITICAL slave_sql_lag Replication lag: 519.56 seconds
Service Critical[2017-05-01 15:53:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;SOFT;7;CRITICAL slave_sql_lag Replication lag: 459.30 seconds
Service Critical[2017-05-01 15:52:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;SOFT;6;CRITICAL slave_sql_lag Replication lag: 399.35 seconds
Service Critical[2017-05-01 15:51:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;CRITICAL;SOFT;5;CRITICAL slave_sql_lag Replication lag: 339.26 seconds
Service Warning[2017-05-01 15:50:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;WARNING;SOFT;4;WARNING slave_sql_lag Replication lag: 279.25 seconds
Service Warning[2017-05-01 15:49:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;WARNING;SOFT;3;WARNING slave_sql_lag Replication lag: 219.23 seconds
Service Critical[2017-05-01 15:49:19] SERVICE ALERT: db1092;MariaDB Slave IO: s5;CRITICAL;HARD;3;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master 'repl@db1063.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1063.eqiad.wmnet' (111 "Connection refused")
Service Warning[2017-05-01 15:48:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;WARNING;SOFT;2;WARNING slave_sql_lag Replication lag: 159.44 seconds
Service Critical[2017-05-01 15:48:19] SERVICE ALERT: db1092;MariaDB Slave IO: s5;CRITICAL;SOFT;2;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master 'repl@db1063.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1063.eqiad.wmnet' (111 "Connection refused")
Service Warning[2017-05-01 15:47:29] SERVICE ALERT: db1092;MariaDB Slave Lag: s5;WARNING;SOFT;1;WARNING slave_sql_lag Replication lag: 99.46 seconds
Service Critical[2017-05-01 15:47:19] SERVICE ALERT: db1092;MariaDB Slave IO: s5;CRITICAL;SOFT;1;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master 'repl@db1063.eqiad.wmnet:3306' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'db1063.eqiad.wmnet' (111 "Connection refused")
Service entered a period of scheduled downtime[2017-05-01 15:44:54] SERVICE DOWNTIME ALERT: db1092;MariaDB Slave SQL: s5;STARTED; Service has entered a period of scheduled downtime
Service entered a period of scheduled downtime[2017-05-01 15:44:54] SERVICE DOWNTIME ALERT: db1092;MariaDB Slave Lag: s5;STARTED; Service has entered a period of scheduled downtime
Service entered a period of scheduled downtime[2017-05-01 15:44:54] SERVICE DOWNTIME ALERT: db1092;MariaDB Slave IO: s5;STARTED; Service has entered a period of scheduled downtime

Despite the scheduled downtime until 2017-05-03, at 16:21:32 UTC, we get
<icinga-wm> PROBLEM - MariaDB Slave Lag: s5 on db1092 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 732.04 seconds

While not as bad as not sending alerts when they happen, this is very annoying and can cause bigger issues due to the false positives. This has been happening at least for a week.

Event Timeline

The issue seems to match the restart of the service.

It seems a service restart causes this but usually we just do reloads on config changes (via puppet) and not hard restarts.

This just happened again for the nth time.

We discarded sync-related, this happened at 18:21 and the sync happens much later, at 18:33 (and in the right direction, tegmen -> einstenium). Maybe it is not "randomly", and it happens every time- server is restarted before dropping the downtimes becaues a typical config change:

May  2 18:21:11 tegmen puppet-agent[49918]: (/Stage[main]/Icinga/Service[icinga]) Triggered 'refresh' from 2 events
May  2 18:21:11 tegmen icinga: Caught SIGHUP, restarting...
May  2 18:21:12 tegmen icinga: Icinga 1.11.6 starting... (PID=48297)
May  2 18:21:12 tegmen icinga: Local time is Tue May 02 18:21:12 UTC 2017
May  2 18:21:12 tegmen icinga: LOG VERSION: 2.0
May  2 18:21:12 tegmen puppet-agent[49918]: Finished catalog run in 40.70 seconds
May  2 18:21:17 tegmen icinga: Event loop started...
May  2 18:21:24 tegmen icinga: SERVICE EVENT HANDLER: ocg1003;MD RAID;OK;HARD;1;raid_handler!md!eqiad
May  2 18:21:24 tegmen icinga: SERVICE ALERT: ocg1003;Disk space;CRITICAL;SOFT;1;DISK CRITICAL - free space: /srv/deployment/ocg/output 9494 MB (3% inode=98%):
May  2 18:21:24 tegmen icinga: SERVICE EVENT HANDLER: mc1021;MD RAID;OK;HARD;1;raid_handler!md!eqiad
May  2 18:21:34 tegmen icinga: SERVICE NOTIFICATION: irc;db1015;MariaDB Slave Lag: s3;CRITICAL;notify-service-by-irc;CRITICAL slave_sql_lag Replication lag: 29393.05 seconds

I think this is an almost an unbreak now, if this keeps happening.

Hi, other users have this problem see https://github.com/Icinga/icinga2/issues/4614 (note that the user is using icinga 2 as the backend but using the classic web ui) I haven't experienced this problem though.

I 've had a look at this. I can't reliably reproduce anything yet. Things I 've already completely ruled out

  • Service restarts (service icinga restart). The process on tegmen is running since Apr 20, it can't be that.
  • The sync process between the active and the passive icinga servers. As noted above that seems to work fine and can't be it
  • Submitting a scheduled downtime, checking it's there and a few seconds later reloading icinga. The downtime was still there

One thing that did seem for a while to have reproduced (but maybe a red herring) was:

  • Submitting the acknowledgement and then reloading icinga without waiting. That seems to have caused a loss of ALL scheduled dowtimes for the hosts I was at that point in time viewing at the icinga web interface. Funnily enough a refresh one second later did show all the downtimes (and the newly scheduled one) normally. It does seem like it's just a bug in the web interface though. I am basing this on the fact that reloading the interace just a few seconds later, the correct information about downtimes is displayed. Reproducing this seems easy and does not even need the submission of an acknowledgement. Just force refreshing 2 times a page in icinga with downtimes right after an icinga reload exhibits the bug (but does not explain the alerts being sent)

Happened again today at 10:47.

I am thinking this was related to T166203#3294072. TL;DR a daemonized puppet agent was running on tegmen. The timing of the killing of that daemonized agent and the issue reported by jynus above seems to line up.

@akosiaris actually this happened ~2h after I've killed the daemonized puppet on tegmen... I'm not sure this explanation can still be valid, thoughts?

Hm, I did a mental -2h on Jaime's reported time above to reach UTC and it lined up. For some reason I still do it when I see no TZ information even though I wish UTC was the default for all timestamps reported. That being said.. it does line up with a regular puppet run from cron. Puppet on tegmen runs on the 16th and the 46th minute of the hour.

I reported on UTC times, I always do.

Yes @akosiaris , all the times it happened was during a cron puppet run and seems to me only when there are changes in the puppet_hosts.cfg generated config file.

Change 356021 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Timestamp puppet-run logs

https://gerrit.wikimedia.org/r/356021

Thanks for setting an example on this one.

So the daemonized puppet agent theory is ruled out :-(.

Looking at the logs from cron puppet runs I can't say I see anything clear. While a tad difficult to figure out the correct run due to the log not being timestamped (fixed in https://gerrit.wikimedia.org/r/356021), I 've tracked down the puppet run. There are a number of changes in that run in the icinga configuration but nothing out of the ordinary that would justify this behavior.

Yes @akosiaris , all the times it happened was during a cron puppet run and seems to me only when there are changes in the puppet_hosts.cfg generated config file.

Actually on this run I only see changes on /etc/icinga/puppet_services.cfg

While a tad difficult to figure out the correct run due to the log not being timestamped

You can see the timestamps on puppet runs on the systemd log, I think on syslog, but at least somewhere on the hosts.

I think this happened again. It didn't page because now I disable alerts every time I reimage a host, but page spam will continue until this is fixed or worked around.

Change 358962 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Revert "Switch einsteinium and tegmen roles"

https://gerrit.wikimedia.org/r/358962

Up to now we have had not been able to reproduce this in any way, only notice it when it happens. We have no clear reproduction case and we do not know what causes this.

We 've left icinga on tegmen for this long in order to manage to reliably reproduce this issue but this hasn't happened yet. In last week's meeting it was proposed to switch over back to einsteinium and see if this will happen again or not, which might help attribute somehow the problem to the current icinga host (tegmen).

Change 358962 merged by Alexandros Kosiaris:
[operations/puppet@production] Revert "Switch einsteinium and tegmen roles"

https://gerrit.wikimedia.org/r/358962

Change 358967 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Revert "Switchover icinga.wikimedia.org to tegmen"

https://gerrit.wikimedia.org/r/358967

Change 358967 merged by Alexandros Kosiaris:
[operations/dns@master] Revert "Switchover icinga.wikimedia.org to tegmen"

https://gerrit.wikimedia.org/r/358967

DNS switched over, hosts exchanged roles, failover is done. einsteinium is once again the icinga host. Now let's see if this issue shows up again.

I have not seen downtimes getting lots since the switch, but I confirm I have seen a probably related issue: services losing virtually its condition of downtime (but not the matched comment), and then getting it back on reload. At first I thought it was a glitch, but it happened once on the web, and once using the api- and now I am 100% sure nothing was done in-between - this looks like a race condition and/or software bug on how the downtime is handled.

This just happened.

db2054 just alerted. Looking at the icinga logs a schedule downtime for both and services was set for up to June 28th

[Wed Jun 21 13:22:33 2017] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;db2054;1498051352;1498656152;1;0;604800;marvin-bot;T166208
[Wed Jun 21 13:22:33 2017] EXTERNAL COMMAND: SCHEDULE_HOST_SVC_DOWNTIME;db2054;1498051352;1498656152;1;0;604800;marvin-bot;T166208
[Wed Jun 21 13:22:34 2017] HOST DOWNTIME ALERT: db2054;STARTED; Host has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;salt-minion processes;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;puppet last run;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;mysqld processes;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;dhclient process;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;configured eth;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;SSH;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;MariaDB disk space;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;MariaDB Slave SQL: s7;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;MariaDB Slave Lag: s7;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;MariaDB Slave IO: s7;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;IPMI Temperature;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;HP RAID;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;Disk space;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;DPKG;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;Check whether ferm is active by checking the default input chain;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;Check the NTP synchronisation status of timesyncd;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;Check systemd state;STARTED; Service has entered a period of scheduled downtime
[Wed Jun 21 13:22:34 2017] SERVICE DOWNTIME ALERT: db2054;Check size of conntrack table;STARTED; Service has entered a period of scheduled downtime

For a period of time the service would change state but no alerts would be created. That can be witnessed in the logs

[Wed Jun 21 13:43:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;1;WARNING slave_sql_lag Replication lag: 105.51 seconds
[Wed Jun 21 13:44:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;2;WARNING slave_sql_lag Replication lag: 165.81 seconds
[Wed Jun 21 13:45:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;3;WARNING slave_sql_lag Replication lag: 226.20 seconds
[Wed Jun 21 13:46:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;4;WARNING slave_sql_lag Replication lag: 286.35 seconds
[Wed Jun 21 13:47:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;5;CRITICAL slave_sql_lag Replication lag: 346.58 seconds
[Wed Jun 21 13:48:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;6;CRITICAL slave_sql_lag Replication lag: 406.88 seconds
[Wed Jun 21 13:49:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;7;CRITICAL slave_sql_lag Replication lag: 467.18 seconds
[Wed Jun 21 13:50:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;8;CRITICAL slave_sql_lag Replication lag: 527.35 seconds
[Wed Jun 21 13:51:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;9;CRITICAL slave_sql_lag Replication lag: 587.72 seconds
[Wed Jun 21 13:52:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;HARD;10;CRITICAL slave_sql_lag Replication lag: 647.91 seconds
[Wed Jun 21 13:54:35 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;OK;HARD;10;OK slave_sql_lag Replication lag: 24.27 seconds
[Wed Jun 21 14:20:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;1;WARNING slave_sql_lag Replication lag: 70.91 seconds
[Wed Jun 21 14:21:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;2;WARNING slave_sql_lag Replication lag: 131.08 seconds
[Wed Jun 21 14:22:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;3;WARNING slave_sql_lag Replication lag: 191.21 seconds
[Wed Jun 21 14:23:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;WARNING;SOFT;4;WARNING slave_sql_lag Replication lag: 251.50 seconds
[Wed Jun 21 14:24:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;5;CRITICAL slave_sql_lag Replication lag: 311.43 seconds
[Wed Jun 21 14:25:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;6;CRITICAL slave_sql_lag Replication lag: 371.66 seconds
[Wed Jun 21 14:26:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;7;CRITICAL slave_sql_lag Replication lag: 431.44 seconds
[Wed Jun 21 14:27:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;8;CRITICAL slave_sql_lag Replication lag: 491.49 seconds
[Wed Jun 21 14:28:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;9;CRITICAL slave_sql_lag Replication lag: 551.42 seconds
[Wed Jun 21 14:29:43 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;HARD;10;CRITICAL slave_sql_lag Replication lag: 611.44 seconds
[Wed Jun 21 14:46:39 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;OK;HARD;10;OK slave_sql_lag Replication lag: 0.00 seconds

2 CRITICAL HARD states without any alert

Then today

...
[Thu Jun 22 12:55:47 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;SOFT;9;CRITICAL slave_sql_lag Replication lag: 553.44 seconds
[Thu Jun 22 12:56:47 2017] SERVICE ALERT: db2054;MariaDB Slave Lag: s7;CRITICAL;HARD;10;CRITICAL slave_sql_lag Replication lag: 613.54 seconds
[Thu Jun 22 12:56:47 2017] SERVICE NOTIFICATION: irc;db2054;MariaDB Slave Lag: s7;CRITICAL;notify-service-by-irc;CRITICAL slave_sql_lag Replication lag: 613.54 seconds

The process runs since Jun 14

icinga   41052 81.1  0.0 166300 65252 ?        SNsl Jun14 9332:54 /usr/sbin/icinga -d /etc/icinga/icinga.cfg

The only related thing I 've found up to now is http://tracker.nagios.org/view.php?id=338 but the patches in that bug do not apply to icinga 11.6 at all. The code seems to have diverted since then (pretty much expected). It also looks unrelated. The scheduled downtime was handled fine in our case and in fact even worked fine for hours, it just was lost sometime later on

For the record, this is what I ran to downtime the hosts above:

icinga-downtime -d 604800 -h $i -r 'T166208'

(I just downtimed them again)

I normally do not try to throw potential random ideas, but I installed db2072 a few hours ago- could that have causes some kind of undesired restart on icinga (through salt, or puppet?).

Nope, no restart. icinga is running since Jun 14. Reloads I am pretty sure happen all the time so if a reload is involved it's just one of the factors.

AFAIK, icinga seems to be using /var/lib/icinga/retention.dat as the persistance. There is no mention of db2054 there, except for the latest downtimes:

servicecomment {
host_name=db2054
service_description=salt-minion processes
entry_type=2
comment_id=2901219
source=0
persistent=0
entry_time=1498136451
expires=0
expire_time=0
author=marvin-bot
comment_data=This service has been scheduled for fixed downtime from 2017-06-22 13:00:51 to 2017-06-29 13:00:51.  Notifications for the service will not be sent out during that time period.
}

The only thing I could think about is to enable the debug log, what do you think?

The only thing I could think about is to enable the debug log, what do you think?

I am a bit wary because of the perfomance penalties this might introduce but we can give it a try first on the passive host. I 'd say let's start of a value of 512 for debug_level[1] ?

[1] https://docs.icinga.com/latest/en/configmain.html#configmain-debug_file

512 sounds reasonable and seems the only pertinent one in that list, +1!

OK, I 've disabled puppet on tegmen and set debug_level to 512. Let's wait a bit and see how this goes and then puppetize the change

Up to now the icinga.debug file is 139k so I think it's safe to enable on the active icinga server as well

Change 361450 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Set debug_level on icinga

https://gerrit.wikimedia.org/r/361450

Change 361450 merged by Alexandros Kosiaris:
[operations/puppet@production] Set debug_level on icinga

https://gerrit.wikimedia.org/r/361450

Seems like this happened again on Fri Jul 7 - db1102 mysqld process - icinga lost downtime. I 'll search the debug logs more and post findings here.

jcrespo raised the priority of this task from High to Unbreak Now!.Jul 11 2017, 12:45 PM

Please challenge my consideration of this being an unbreak now as pages are being randomly sent at any hour making people ignore pages, creating a larger outage in the future.

if this wasn't clear, this happened at 12:41 UTC today again.

Tracked dates: 1 May, 2 May, 26May, 6 Jun, 22 Jun, 7 Jul, 11 Jul

Those are only the dates where this was identified- downtimes lost not always means an alert, plus an alert isn't always noticed as the cause.

Another one experienced today: 12 Jul
db1066

These are the downtimes logs:

[1499749949] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;db1066;MariaDB Slave IO: s1;1499749931;1499929931;1;0;7200;Marostegui;T166204
[1499749949] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;db1066;MariaDB Slave Lag: s1;1499749931;1499929931;1;0;7200;Marostegui;T166204
[1499749949] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;db1066;MariaDB Slave SQL: s1;1499749931;1499929931;1;0;7200;Marostegui;T166204
root@einsteinium:~# date -d @1499749931
Tue Jul 11 05:12:11 UTC 2017
root@einsteinium:~# date -d @1499929931
Thu Jul 13 07:12:11 UTC 2017

Another data point. Alerts starting at 2017-07-12T12:44 for labtestpuppetmaster2001 which @Andrew had marked as downtimed until 2019.

And happened again

19:55 < icinga-wm> PROBLEM - mysqld processes on db1102 is CRITICAL: PROCS CRITICAL: 3 processes with command name mysqld
19:58 < icinga-wm> PROBLEM - MariaDB Slave Lag: s7 on db1041 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.60 seconds
akosiaris renamed this task from Icinga randomly forgets downtimes, causing alert and page spam to Icinga loses downtime entries, causing alert and page spam.Jul 13 2017, 8:15 AM

I will log today's one, just to keep a track of them just in case they are useful when checking logs.
db1053 just paged:

˜/icinga-wm 10:41> PROBLEM - MariaDB Slave Lag: s4 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2484.24 seconds

And these are the logs from the downtime

[Thu Jul 13 07:44:35 2017] EXTERNAL COMMAND: SCHEDULE_SVC_DOWNTIME;db1053;MariaDB Slave Lag: s4;1499931859;1500284659;1;0;7200;Marostegui;T168661
[Thu Jul 13 07:44:35 2017] SERVICE DOWNTIME ALERT: db1053;MariaDB Slave Lag: s4;STARTED; Service has entered a period of scheduled downtime
root@einsteinium:/var/log/icinga# date -s @1499931859
Thu Jul 13 07:44:19 UTC 2017
root@einsteinium:/var/log/icinga# date -s @1500284659
Mon Jul 17 09:44:19 UTC 2017

Change 364991 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] icinga: Bump icinga-tmpfs size to 1024m

https://gerrit.wikimedia.org/r/364991

Change 364991 merged by Alexandros Kosiaris:
[operations/puppet@production] icinga: Bump icinga-tmpfs size to 1024m

https://gerrit.wikimedia.org/r/364991

Volans lowered the priority of this task from Unbreak Now! to High.Jul 13 2017, 1:19 PM

So after a lot of digging and live debugging between @akosiaris and me, we *think* to have an explanation and to have fixed the issue.

The TL;DR is that the tmpfs used by Icinga for some operations, /var/icinga-tmpfs, was too small and was getting full under specific circumstances. It has been already increased and, if all the reasoning behind it is correct, it should have solved the issue. (the use of conditional is mandatory after so much trouble)

During the debugging process a full restart of icinga was also tried, that unfortunately caused the loss of the previous state and a shower of notifications in IRC and the loss of downtimes and acknoledgement. It also helped us find the root cause, hence definitely worth it! Thanks @akosiaris ;)

To make it short, after reading a lot of Icinga code and so many attempts to repro it on the host without success, this is what was happening.

According to our configuration, the files that Icinga saves into the tmpfs directory are:

  • status.dat (~47MB)
  • temporary copy of status.dat (~47MB, we save the file every 10s)
  • check results file (5~10k files of ~500 bytes each)
  • temporary copy of the retention.dat file written when reloading/restarting (~47MB)

The existing tmpfs partition was 128MB... I leave the math to the reader ;) If all things aligned well in time, the retention.dat file might not be written correctly.
But why there was no error on Icinga logs?

Well...
In xdata/xrddefault.c, in xrddefault_save_state_information(), when the retention file is created, opens a temporary file with xrddefault_temp_file as prefix, that is defined by DEFAULT_TEMP_FILE that in turns should come from the configuration parameter temp_file=/var/icinga-tmpfs/icinga.tmp.
Then it starts writing all the content to the file with fprintf(fp, ...) without checking the return value of it. This makes it blindly ignore any error writing to the file because no left space on device.
At the end of the writing, this is also executed:

fflush(fp);
fsync(fd);
result = fclose(fp);

And only if that result is different from zero an error is logged.
So why downtimes were lost?
Simply because they are the last ones written to the file, after comments and contact state informations.

Also, when reading the file, the parser proceed line by line and there is no EOF message to check for and seems from the logs that a partially read block gets discarded silently. (I didn't dig deeper on this too).
I'll probably open some bugs to Icinga upstream after checking that the same applies to the current master's HEAD too.

Nice troubleshooting @Volans and @akosiaris! Let's hope this is the root cause and we can move on :-)

That is a very satisfying explanation, and it would fit 3 things I suspected:

  • That it happened on both servers (so it was not a spurious file corruption)
  • That it was load-related as in "too many things to do" (which would explain why it happens randomly, but more often lately)
  • It could maybe create phantom results at times (maybe?)

Do we need some monitoring until a proper code fix is maybe applied? - even if it is just a cron sending an email?

Thank you, thank you very much

As a comment, the "phantom results" still happens, but it was not part of the scope of this, nor it is that much impacting. With phantom results I mean that for a few seconds alerts are shown on the webpage/api for unhandled alerts when in reality they are in downtime. On refresh, they are shown as down.

Volans claimed this task.

After few days without incident seems that we can call it resolved! \o/

Change 356021 abandoned by Alexandros Kosiaris:
Timestamp puppet-run logs

Reason:
Abandoning in favor of https://gerrit.wikimedia.org/r/#/c/425538/.

https://gerrit.wikimedia.org/r/356021