Page MenuHomePhabricator

Notifications disablement via puppet not working on icinga
Closed, ResolvedPublic

Description

Due to db1078's crash (T209754) I had to disable notifications for that host.
I did it via puppet: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474447/

I ran puppet on db1078 and after that I ran puppet on icinga1001:

Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for icinga1001.wikimedia.org
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1542437455'
Notice: /Stage[main]/Icinga/File[/var/lib/icinga/retention.dat]/group: group changed 'nagios' to 'www-data'
Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_hosts.cfg]/content:
--- /etc/icinga/objects/puppet_hosts.cfg	2018-11-16 21:45:49.741852752 +0000
+++ /tmp/puppet-file20181117-48036-zhxmnc	2018-11-17 06:52:09.977243985 +0000
@@ -8377,7 +8377,7 @@
 	notification_interval          0
 	notification_options           d,u,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	parents                        asw2-c-eqiad
 	statusmap_image                vendors/debian.gd2


Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_hosts.cfg]/content: content changed '{md5}e0cf8894edd8e2d2374ff536fc2d255d' to '{md5}457fc8ebdad17df6e9308dd06bd5e82e'
Info: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_hosts.cfg]: Scheduling refresh of Service[icinga]
Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_services.cfg]/content:
--- /etc/icinga/objects/puppet_services.cfg	2018-11-16 21:45:50.173853762 +0000
+++ /tmp/puppet-file20181117-48036-1283is9	2018-11-17 06:52:10.237244572 +0000
@@ -159730,7 +159730,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            dhclient process
@@ -159751,7 +159751,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            configured eth
@@ -159772,7 +159772,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 10
 	service_description            IPMI Sensor Status
@@ -159793,7 +159793,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            Check systemd state
@@ -159814,7 +159814,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            Check size of conntrack table
@@ -159835,7 +159835,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            Disk space
@@ -159877,7 +159877,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            DPKG
@@ -159899,7 +159899,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 5
 	service_description            Memory correctable errors -EDAC-
@@ -159920,7 +159920,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            Check whether ferm is active by checking the default input chain
@@ -159942,7 +159942,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 5
 	service_description            Filesystem available is greater than filesystem size
@@ -159963,7 +159963,7 @@
 	notification_interval          240
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            MariaDB disk space
@@ -159984,7 +159984,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            MariaDB read only s3
@@ -160005,7 +160005,7 @@
 	notification_interval          240
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            MariaDB Slave IO: s3
@@ -160026,7 +160026,7 @@
 	notification_interval          240
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            MariaDB Slave Lag: s3
@@ -160047,7 +160047,7 @@
 	notification_interval          240
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            MariaDB Slave SQL: s3
@@ -160068,7 +160068,7 @@
 	notification_interval          240
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            mysqld processes
@@ -160089,7 +160089,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            puppet last run
@@ -160111,7 +160111,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 10
 	service_description            HP RAID
@@ -160133,7 +160133,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 5
 	service_description            Device not healthy -SMART-
@@ -160154,7 +160154,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            SSH
@@ -160196,7 +160196,7 @@
 	notification_interval          0
 	notification_options           c,r,f
 	notification_period            24x7
-	notifications_enabled          1
+	notifications_enabled          0
 	passive_checks_enabled         1
 	retry_interval                 1
 	service_description            Check the NTP synchronisation status of timesyncd

Notice: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_services.cfg]/content: content changed '{md5}9a3a82240e9d4e60cee10a1deb89a6e6' to '{md5}ab420d7c1a60aa1966f2ab12d2e9b403'
Info: /Stage[main]/Icinga::Naggen/File[/etc/icinga/objects/puppet_services.cfg]: Scheduling refresh of Service[icinga]
Notice: /Stage[main]/Icinga::Web/Letsencrypt::Cert::Integrated[icinga]/Exec[acme-setup-acme-icinga]/returns: executed successfully
Notice: /Stage[main]/Icinga/Systemd::Service[icinga]/Service[icinga]: Triggered 'refresh' from 2 events
Notice: Applied catalog in 65.78 seconds

However, on Icinga's UI I only saw _some_ services being disabled.
This is the IRC conversation that happened after me asking:

06:56 < marostegui> shdubsh: I pushed: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474447/ but I am only seeing a few services disabled, not all of them, is that a known icigan issue?                               
06:58 < volans|off> marostegui: did puppet run first on db1078 and then on icinga? I'm checkking anyway                                                                                                                         
06:59 < marostegui> I did manually first on db1078 and after that on icinga                                                                                                                                                     
06:59 < volans|off> which icinga?                                                                                                                                                                                               
06:59 < volans|off> icinga1001, not einsteinium                                                                                                                                                                                 
06:59 < marostegui> volans|off: 1001                                                                                                                                                                                            
06:59 < volans|off> interesting... checking
07:00 < volans|off> I can see some of them have a local modification, let me reset all the local states for all services on that host                                                                                           
07:03 < shdubsh> marostegui: not a known issue, no. That code path should be intact as we tried to avoid touching base as much as possible                                                                                      
07:03 < volans|off> the notifications_enabled seems to be 0 in the config as expected                                                                                                                                            
07:04 < marostegui> volans|off: yeah, I saw that on puppet
07:06 < volans|off> and /var/cache/icinga/objects.cache too have the correct value (0)                                                                                                                                          
07:07 < volans|off> shdubsh: btw that file was -rw-r--r--  1 nagios nagios, after your patch is -rw-r--r-- 1 nagios www-data                                                                                                    
07:07 < volans|off> [totally unrelated, just noticed]                                                                                                                                                                           
07:10 < shdubsh> That's expected.  The permissions were copied from a clean install                                                                                                                                             
07:10 < volans|off> shdubsh: both status.dat and retention.dat have notifications_enabled=1
07:13 < volans|off> sorry I correct myself
07:13 < volans|off> retention.dat has notifications_enable=1 also for those that appear disabled in the UI
07:13 < volans|off> status.dat reflect was it's on the UI

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 17 2018, 7:12 AM

When I was stasrted to run puppet on the new Parsercache hosts on the other day I disabled notifications in the same way, but the hosts were reporting error.

When I was stasrted to run puppet on the new Parsercache hosts on the other day I disabled notifications in the same way, but the hosts were reporting error.

Correct me if I am wrong but I believe that error had nothing to do with Icinga (when I checked it was the pt-heartbeat error)

Volans updated the task description. (Show Details)Nov 17 2018, 7:17 AM

There was an error with pt-heartbeat indeed, but that error was reported to IRC which shouldn't happened if disabling notifications would work.
(Or maybe I miss some key part here)

For the record I have downtimed db1078 (without touching notifications anymore to avoid messing with any investigation).

Volans triaged this task as High priority.Nov 17 2018, 8:41 AM
Volans added a subscriber: faidon.

Things that I've found so far, some may be unrelated but still need a fix anyway.

Permissions

It seems that https://gerrit.wikimedia.org/r/c/operations/puppet/+/473789/4/modules/icinga/manifests/init.pp doesn't set the default permission as we're seeing more or less at each Puppet run (I guess each time after an Icinga restart):

Notice: /Stage[main]/Icinga/File[/var/lib/icinga/retention.dat]/group: group changed 'nagios' to 'www-data'

That's why I was and I still am against that patch as we shouldn't manage those permission in Puppet IMHO.

Init files

Those two init files seems to still have the old paths for jessie and are not compatible with stretch:

  • modules/icinga/files/icinga-init.sh
  • modules/icinga/files/default_icinga.sh

This seems at least to cause the failure of the modules/icinga/files/purge-nagios-resources.py:

Nov 17 06:52:54 icinga1001 systemd[1]: Starting LSB: icinga host/service/network monitoring and management system...
Nov 17 06:52:54 icinga1001 icinga[123849]: Traceback (most recent call last):
Nov 17 06:52:54 icinga1001 icinga[123849]:   File "/usr/local/sbin/purge-nagios-resources.py", line 62, in <module>
Nov 17 06:52:54 icinga1001 icinga[123849]:     hosts = readHostsFile(sys.argv[1])
Nov 17 06:52:54 icinga1001 icinga[123849]:   File "/usr/local/sbin/purge-nagios-resources.py", line 17, in readHostsFile
Nov 17 06:52:54 icinga1001 icinga[123849]:     for line in file(path, 'r'):
Nov 17 06:52:54 icinga1001 icinga[123849]: IOError: [Errno 2] No such file or directory: '/etc/icinga/puppet_hosts.cfg'
Nov 17 06:52:59 icinga1001 icinga[123849]: Starting icinga monitoring daemon: icinga.

Notification enabled attribute

From a quick ballpark check those are the differences in the notification_enabled configuration in the various files (sum puppet_hosts.cfg and puppet_services.cfg to compare them):

$ sudo egrep -c 'notifications_enabled(=|\s+)0' /etc/icinga/objects/puppet* /var/icinga-tmpfs/status.dat /var/cache/icinga/objects.cache /var/lib/icinga/retention.dat
/etc/icinga/objects/puppet_hosts.cfg:80
/etc/icinga/objects/puppet_services.cfg:1387  # +80 above = 1467
/var/icinga-tmpfs/status.dat:1516
/var/cache/icinga/objects.cache:1467
/var/lib/icinga/retention.dat:1516  # In a previous grep it was returning 1511 though, don't know why

At the best of my knowledge status.dat is the one that affects the UI. The higher number on the generated files is expected as it include
I've ballpark compared status.dat and retention.dat regarding the services checks only and the notifications_enabled parameter with:

$ egrep -A59 '^service {' /var/lib/icinga/retention.dat | egrep '(^service|host_name|service_description|notifications_enabled)' | sed 's/^\s\+//' > services.retention.extract.txt
$ egrep -A59 '^servicestatus' /var/icinga-tmpfs/status.dat | egrep '(servicestatus|host_name|service_description|notifications_enabled)' | sed 's/^\s\+//' | sed 's/^servicestatus {/service {/' > services.status.extract.txt
$ diff services.retention.extract.txt services.status.extract.txt
$ # no diff, files are identical

And the is no difference there. The diff with the puppet-generated file is a bit more complex due to different ordering, but should be investigated to see what's changing there.

I guess the investigation can continue tomorrow in normal business hours, but feel free to ping/page me if needed.

jijiki added a subscriber: jijiki.Nov 17 2018, 9:41 AM
Dzahn added a comment.Nov 17 2018, 2:26 PM

Init files

Those two init files seems to still have the old paths for jessie and are not compatible with stretch:

  • modules/icinga/files/icinga-init.sh

This is not installed, it's inside "if os_version('debian == jessie')" and pending patch to remove jessie support will remove the code entirely. (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/473276/)

  • modules/icinga/files/default_icinga.sh

This is only the diff between the upstream init file and our config: ICINGACFG="/etc/icinga/icinga.cfg", CGICFG="/etc/icinga/cgi.cfg", NICENESS=0 but also the purge-nagios-resources.py , correct.

This seems at least to cause the failure of the modules/icinga/files/purge-nagios-resources.py:

Yes, the path to /etc/icinga/ needs to be /etc/icinga/objects/ for puppet_hosts.cfg and puppt_servers.cfg. Patch incoming!
g

Change 474463 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] fix path to puppet_hosts/services in default_icinga.sh

https://gerrit.wikimedia.org/r/474463

Change 474464 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] icinga: do not manage retention.dat in puppet

https://gerrit.wikimedia.org/r/474464

Change 474463 merged by Dzahn:
[operations/puppet@production] icinga: fix path to puppet_hosts/services in default_icinga.sh

https://gerrit.wikimedia.org/r/474463

Change 474464 merged by Dzahn:
[operations/puppet@production] icinga: do not manage retention.dat in puppet

https://gerrit.wikimedia.org/r/474464

Dzahn added a comment.EditedNov 17 2018, 2:57 PM

Notice: /Stage[main]/Icinga/File[/var/lib/icinga/retention.dat]/group: group changed 'nagios' to 'www-data'
That's why I was and I still am against that patch as we shouldn't manage those permission in Puppet IMHO.

I removed retention.dat management from puppet. https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474464/

Dzahn added a comment.Nov 17 2018, 3:17 PM

To summarize:

  • permissions on retention.dat: They are now:

    56M -rw-r--r-- 1 nagios nagios 56M Nov 17 14:53 retention.dat

and not managed in puppet anymore. (https://gerrit.wikimedia.org/r/474464)


I then tested disabling notifications on another host in the same way it was done for the db host, using ununpentium (RT):

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474465/

I ran puppet on the host and icinga1001 and it worked as expected. All services were "notification disabled" in the web UI.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ununpentium

And db1078 picked it up too and has all notifications disabled.
I guess this is fixed then or is there any other follow up needed?

Dzahn added a comment.Nov 17 2018, 3:33 PM

Also db1078 specifically is now fixed without further steps. Before the changes above just some services had notifications disabled in web UI, but now (after the last icinga restart triggerred by my test) ALL services have notifications disabled in web UI.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=db1078

Therefore i think i can claim it as resolved.


P.S. Of course this had to be fixed, but let me say unrelatedly i think we should normally not use this method (disable notifications) and instead "schedule downtime" was the better option in the first place. It has the same desired effect (no more notifications) but has additional benefits, which are: the host is not shown in the "unhandled" column in web ui and it is directly obvious to other users this is handled / known, it does not require a restart of icinga and a manual puppet run, it automatically turns back on after a while but disabled notifications are easily forgotten and can stay disabled without meaning to.

Dzahn closed this task as Resolved.Nov 17 2018, 3:41 PM
Dzahn claimed this task.

And db1078 picked it up too and has all notifications disabled.
I guess this is fixed then or is there any other follow up needed?

Yes, it should be fixed. I just wanted to make sure i replied to all things Volans listed above. Which i did for 2 out of 3.

The last one was comparing the numbers in these files:

sudo egrep -c 'notifications_enabled(=|\s+)0' /etc/icinga/objects/puppet* /var/icinga-tmpfs/status.dat /var/cache/icinga/objects.cache /var/lib/icinga/retention.dat 
/etc/icinga/objects/puppet_hosts.cfg:81
/etc/icinga/objects/puppet_services.cfg:1400
/var/icinga-tmpfs/status.dat:1547
/var/cache/icinga/objects.cache:1481
/var/lib/icinga/retention.dat:1516

objects.cache is the sum of puppet_hosts and puppet_services (1400 + 81)

and, quote ". The higher number on the generated files is expected"

So yea, i'm calling resolved. Please reopen it if you think otherwise.

Dzahn added a comment.Nov 17 2018, 3:51 PM

reverted my test patch on ununpentium and all notifications are enabled again, while db1078 still has all disabled. so that worked too.. it just causes an icinga restart each time where the web ui isn't available for a couple seconds.

i think we should normally not use this method (disable notifications) and instead "schedule downtime" was the better option in the first place

Please tell me how to downtime new hosts and services that are not yet added to icinga (e.g. to downtime a new host being setup). When new databases are being setup they will page by default, as there is a long time between running puppet and them finishing productionizing (mostly due to data provisioning). Also downtime a host frequently fails until icinga is restarted. Also it is controlled by puppet, so it cannot be forgotten because it is on the code (vs doing it on the web ui)- eg it is done when setup as a spare or a test mysql host. I don't think that clicking buttons rather than writing code is cleaner.

@Volans I reported this very issue, believing firmly this was a bug on our icinga installation and was discarded very dismissively. I make mistakes and I am not always right, but when I claim there is a bug I normally don't do it lightly. Specially because this is not the first nor the second time such a bad state on the alerting system happened to us. And given we are such a heavy users of it, so we notice every small quirk quickly.

Dzahn added a comment.Nov 19 2018, 4:50 PM

i think we should normally not use this method (disable notifications) and instead "schedule downtime" was the better option in the first place

Please tell me how to downtime new hosts and services that are not yet added to icinga

Yep, you are right, it makes sense for newly added hosts. My comment only applies in the context of existing hosts that exhibit a failure.

Volans added a comment.EditedNov 19 2018, 8:18 PM

@Volans I reported this very issue, believing firmly this was a bug on our icinga installation and was discarded very dismissively. I make mistakes and I am not always right, but when I claim there is a bug I normally don't do it lightly. Specially because this is not the first nor the second time such a bad state on the alerting system happened to us. And given we are such a heavy users of it, so we notice every small quirk quickly.

@jcrespo
Please correct me if I'm wrong, but IIRC all the other times we had those kind of issues they were related to the current state of Icinga (downtimes, overriden disabled notifications from the UI, etc.) and not related to a disk configuration that was not reflected in the UI. But we had many of those, so I might be mis-remembering.
I think we've always tried to dig into Icinga issues when there was some evidence, either in the UI or the logs or a way to repro it. Sometimes unfortunately if a manual action was taken more than a week before we didn't had the logs and so couldn't check the chain of events.
But if I gave a different impression I'm sorry, it was not intentional. Also there might be some misunderstanding due to different meanings. For example for me an 'Icinga bug' is a bug in the Icinga code base, that requires a patch to be fixed. A misconfiguration, wrong permissions on a file, memory partition too small (just to mention real issues we had), for me are just operational/configuration issues, I wouldn't call those 'Icinga bugs'.

@Dzahn @colewhite
As a side note, what struck me with this issue is that I really find very strange that the UI was showing only some of the checks for that host with disabled notification and others not, even though the disk configuration was correctly having them disabled. Do we have an explanation for this behaviour? Otherwise I'm not sure we can say it's not solved.

@Volans I do not think we have a good explanation for that behavior. It's deprecated software and we are transferring state files between hosts and Icinga versions. It's not surprising to me that we run into some weird problems that we must hunt down.

We believe the fixes that have been implemented here should resolve this issue, but if it helps we could reopen it.