Page MenuHomePhabricator

remove cloud "dev" hosts from Icinga?
Closed, ResolvedPublic

Description

When looking at Icinga it's noticable we often have alerts on machines called "cloudvirt*-dev".

These checks often have disabled notifications and when you search in SAL or Phabricator you can't find a reference to ongoing work.

The "dev" part and the disabled notifications make me wonder if it makes sense to have them in Icinga in the first place.

In the past we spent some work on adding parameters to monitoring classes making it possible to disable base checks for certain hosts. For example this was then done via a regex in Hiera for hosts with 'test' in their name.

Wouldn't it make sense to remove these altogether? Hosts that have "dev" in their name seem to be for developing / testing and not production hosts per definition.

Should we really add those checks (to an already overloaded Icinga) only to then regularly disable notifications? It seems a bit of a waste of resources, both on the machine and the humans checking the Icinga web UI.

If we do want to keep them, could we please ACK or downtime these hosts instead of disabling notifications? That way they would not show up as "unhandled". A downtime with a link to a ticket would be ideal.

Event Timeline

Dzahn created this task.Apr 21 2020, 10:50 AM
Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptApr 21 2020, 10:50 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Dzahn updated the task description. (Show Details)Apr 21 2020, 10:51 AM
JHedden triaged this task as Medium priority.Apr 21 2020, 4:06 PM

The hosts in codfw are used for platform testing and staging. It's useful to have these in Icinga, but we don't need email notifications or on the alerts sub-page dashboard. Potentially we can add a host and service downtime for a _very_ long time.

Agreed the hosts might as well not show up on https://icinga.wikimedia.org/alerts, it seems to me we can extend what we do for test hosts to these dev hosts ?

We can simply set profile::base::notifications=disabled in Hiera for these?

We can simply set profile::base::notifications=disabled in Hiera for these?

I didn't check the code, but notification disabled might still show them on /alerts

Dzahn added a comment.Apr 27 2020, 1:12 PM

The goal is to clear the "unacknowledged services CRITICAL" column in Icinga. Disabling notifications is not a form of acknowledging, scheduling downtimes or ACKing is.

Andrew claimed this task.May 5 2020, 4:41 PM
Andrew added a subscriber: Andrew.

It's useful to have these visible in icinga, but they ought to all be downtimed (pretty much forever). I'll double-check that this is currently the case.

Andrew closed this task as Resolved.May 5 2020, 6:05 PM

I've downtimed all cloud*-dev.* servers until 2030.