Review Icinga alarms with disabled notifications
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Volans
	Oct 31 2016, 11:22 PM

Description

We have 231 services and 10 hosts with disabled notification on Icinga as of now.

I'm also adding DBA because quite a few of them are DB-related, but not only.
In most cases the the host itself and other checks have notifications enabled, just few checks have them disabled, without any related message that explains why.

Please review the list of them available here and re-enable notifications when possible or add a permanent message that explain why it is disabled, it help when debugging issues/pages.

Related Objects

Mentioned In: T252002: How to handle Icinga disabled notifications?
Mentioned Here: T147309: Decommission db1019

Event Timeline

Volans created this task.Oct 31 2016, 11:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2016, 11:22 PM

Volans updated the task description. (Show Details)Oct 31 2016, 11:22 PM

@Volans: Can you send a mail to the ops@ list? Otherwise people will miss the task.

@MoritzMuehlenhoff: done 😉

A number of hosts have been decommissioned. As per new instructions, they are not removed from icinga (puppet) nor stopped, and role::spare doesn't do that either.

Leaving the host on, but with role::spare will allow it to receive security updates

Disabling notifications is the only way to not alert.

In T149643#2762562, @jcrespo wrote:

Disabling notifications is the only way to not alert.

@jcrespo: I know, and there are also other reasons why make sense to disable notifications, that's why I added:

or add a permanent message that explain why it is disabled

I think we should not have notifications disabled without any comment on Icinga that links it to a Phab task or explains why.

But they have it, see:

2016-09-21 09:01:06	Marostegui	This server is going to be decommissioned - T146265

This is for db1019. It is just not shown on the link you shared as a pop-up, you have to enter on the host and read the comment there (which is an Icinga usability issue).

In T149643#2762654, @jcrespo wrote:
But they have it, see:
2016-09-21 09:01:06	Marostegui	This server is going to be decommissioned - T146265
This is for db1019. It is just not shown on the link you shared as a pop-up, you have to enter on the host and read the comment there (which is an Icinga usability issue).

And that is: T147309

@jcrespo @Marostegui: db1019 it's fine, it has a scheduled downtime with a related comment and you can see it directly from the link I put in the description that it has one.

While for example db1042, db1047, dbstore1002, es2019 and labsdb10[08-11] don't have any comment nor in the host nor in the specific services that have notifications disabled. And so was db1065 when I checked it the other day.
Not every host/service in that list needs an action, that is the list of all the disabled ones.

@Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment I was referring to is not shown on the link.

If you assign role::spare to a server and run puppet on the host and the icinga host, you should remove all alerts there.

If that's not the case, it is a bug in puppet and we should fix it.

@Joe Apparently it removes it, if it didn't, it would show mysql alertss. But it keeps the common ones, which we do not want to show (plus potentially any running process that could be a threat to the security- like mysql running or its data). I think it removes comments or icinga-only changes on reload? Not sure about that.

Anyway, I would like to put down completely a host until it has been formatted, *then* it can be added to role:spare and be up.

In T149643#2762717, @jcrespo wrote:

@Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment I was referring to is not shown on the link.

@jcrespo just for clarification, the comment you quoted on db1019 is exactly the downtime comment, the one you enter when setting a downtime, and it will expire with the downtime on 2018-09-21 11:00:13 (see End Time).
Then Icinga add an automatic host comment of type Scheduled downtime that will be cleared too at the end of the downtime.

In T149643#2762748, @Volans wrote:

In T149643#2762717, @jcrespo wrote:

@Volans db1019 scheduled downtime will eventually expire (where the comments is), the comment I was referring to is not shown on the link.

@jcrespo just for clarification, the comment you quoted on db1019 is exactly the downtime comment, the one you enter when setting a downtime, and it will expire with the downtime on 2018-09-21 11:00:13 (see End Time).
Then Icinga add an automatic host comment of type Scheduled downtime that will be cleared too at the end of the downtime.

So actually, that is a problem too, and the cause of this ticket- when it expires, the comment will be gone. However, if we add a manual comment and then we remove the downtime or enable notifications, the comment will be wrong. It is a catch-22. This actually has nothing to do with the ticket and I am not argumentation anything- except explaining its origin.

The other part is that when a new host gets reimaged for the first time, it will start paging, even with the new script (because provisioning them + replication catchup can take days), so this is the way to avoid that- and then those get forgotten.

In T149643#2762731, @jcrespo wrote:

Anyway, I would like to put down completely a host until it has been formatted, *then* it can be added to role:spare and be up.

I very much agree with that. Having the hosts shutdown only had advantages IMO (less maintenance overhead, less energy consumption etc.). If we ensure the process that these are only reclaimed/decomming after wiping, we don't have any disadvantages I can see. Maybe let's move that discussion to the ops list?

I agree that we should not have disabled notifications _without_ a comment on them, ideally a reference to a ticket every time. But it's ok to have them if they have a comment AND they have been ACKed in Icinga , or alternatively put in a scheduled downtime.

We need this in-between state for some hosts, typically during setup or decom, for example we often want to keep a server around for a little while "just in case" after a migration butwant to stop services/puppet until shutdown. We just have to avoid forgetting to re-enable notifications again once appropriate.

There are also 2 ways to add a comment, when you acknowledege a problem and leave a comment while doing that, it will stay _until the next status change_ and then disappear. Sometimes this is great, if you know that once this service turns OK again it should be back to normal. Sometimes you have a flapping service while working on something though, in that case you should disable notifications and leave a "sticky" comment on it. This type will not disappear by itself, but the price is you have to remember to remove it.

Overall, we should aim to always keep the number of unhandled checks to a minimum (the first number when you look at Icinga web UI and it says something like: "6 / 34 / 118 CRITICAL ) The other 2 numbers, 'acknowledeged' and 'handled' are ok to have and when checking Icinga we shouldn't have to worry about since it tells us somebody already looked (and linked to some ticket where it makes sense)

I also agree that we should shutdown hosts completely while they are not used. Less energy wasted, environment impact lowered.

Having unused hosts idle for a longer time than necessary always makes me think of 2 things, either that i want to install BOINC client to donate spare CPU cycles to a good cause (would you let me ?:) (https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing) or to shut them down early.

When decom'ing hosts i usually do ~ "schedule downtime for a few days, link to decom ticket" -> "stop services" -> "remove from puppet/icinga" -> "confim bacula backup" -> "wait a week" -> "shutdown -h now the host by myself" -> "remove DNS entries, keep mgmt DNS" -> "create ticket asking dc-ops for disk wipe and to decide if it goes back in spare or not"

I've done a bit of cleanup, re-enabling some of them that were ok and leftover of other maintenance. maps-test* is being worked by @Gehel for a proper fix.
All the others at this time are in scheduled downtime or have a comment.

re-enabled notifications on some install1001/2001 services

Volans triaged this task as Medium priority.Nov 23 2016, 9:02 AM

Gehel unsubscribed.Jun 20 2017, 1:33 PM

I have reviewed and added the ones for the DBs that could already be enabled back. As soon as puppet starts running they should be picked up.
Thanks for the report!

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 8:54 PM

fgiunchedi added a project: observability.May 5 2020, 3:03 PM

ayounsi mentioned this in T252002: How to handle Icinga disabled notifications?.May 6 2020, 11:09 AM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Jul 6 2020, 11:33 AM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:22 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Alerting; removed SRE Observability.Aug 9 2021, 3:33 AM

lmata moved this task from Inbox to Backlog on the Observability-Alerting board.Aug 10 2021, 3:18 PM