Page MenuHomePhabricator

How to handle Icinga disabled notifications?
Open, MediumPublic

Description

trying to gather the concerns I hear over Icinga disabled notifications.

  1. They clutter the active alert page
  2. For production hosts, they are being forgotten, causing legitimate alerts to go un-noticed (see T221282 and T149643)
  3. For hosts with notifications disabled permanently, they use Icinga's already limited resources
  1. Has been solved with with https://gerrit.wikimedia.org/r/c/operations/puppet/+/594441
  2. Which seems to be the most common issue can be solved with:
  3. Which seems to be the most polemic issue, can be solved with:
    • Dedicated monitoring (other than prod Icinga)
    • Bigger server?

But would require more investigation on the exact usecase, scale and impacts. Maybe solving #2 would be enough.

Event Timeline

Policy to not disable notification on new server install, but instead follow Icinga#Avoid_Icinga_spam_on_new_server_installs

I'm not sure about the references in this link, doesn't seem to refer to new installs at all.

New installs are covered by the reimage script within the downtime set at the start, that is 2 hours IIRC. If a new install has still alarming checks after 2h because it needs some sort of setup that requires more time (data import for example), ideally only those specific checks should be donwtimed for a longer period to prevent spam.
The other issue might be within some limited race condition in case multiple reimages are run at the same time (due to the slowness of puppet runs on the Icinga host).

Please let me know if there is any case not yet covered/know regarding the reimage part.

re: point 2 my take is that only test/dev hosts should have notifications disabled (via puppet) and disabling notifications via the icinga UI should be discouraged. Definitely reviewing disabled notifications at the alert review (or periodically anyway, since it should be straightforward to decide whether an host should haven otifications disabled or not).

re: point 3, icinga hw is coming! T251644: Icinga refresh hardware selection (2020). I definitely see the dissonance in having hosts checked but notifications disabled forever but I'm not sure if disabled notifications make a significant dent in the grand scheme of things

colewhite triaged this task as Medium priority.May 6 2020, 3:24 PM
colewhite subscribed.