Page MenuHomePhabricator

How to page when a host is down?
Closed, ResolvedPublic

Description

The nova-network service on labnet1002 is critical and needs to page. As of today, it is configured to do so.

Alas, if the whole system goes down, the icinga PING test will suppress the nova-network service test. And the PING test does not, itself, page. So we'll get pages if the service dies but not if the whole box goes down.

Is there a way to mark a particular host as critical so that we get paged if the PING test fails? If not, we need a way.

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added a project: acl*sre-team.
Andrew added a subscriber: Andrew.

The monitoring::host class has a contact_group parameter, like the monitoring::service class does.

monitoring::host { $::hostname:
    contact_group => $contact_group

and this is used in base and defaults to "admins".

class base::monitoring::host(
     $contact_group = 'admins',

You would have to override that and set it to "admins,sms". to get paging for the host.

But... is that really needed?

You say: " if the whole system goes down, the icinga PING test will suppress the nova-network service test." but i'm not so sure about that.

I think as long as you do _not_ define any service dependencies [http://docs.icinga.org/latest/en/dependencies.html] nothing will be supressed by another check. So, if the host is down, naturally also the service on it will be down.. and page.

In other cases this is what annoys us, right? That there are no dependencies defined and if one thing goes down we get all the notifications for all the services on it.

But let's test if we can?

I agree with @Dzahn, it seems what we need is higher level checks for labs networking functioning and page on those. In general I think we're better off having pages for high level service status or IOW things as perceived by users

Agreed, define a high level test which checks Labs networking and is not dependent on a single node being up.

I think as long as you do _not_ define any service dependencies [http://docs.icinga.org/latest
/en/dependencies.html] nothing will be supressed by another check. So, if the host is down, naturally also the
service on it will be down.. and page.

This goes to the root of my question. I can see from the IRC logs that when labnet1001 went down there was only one alert: host down. None of the other services local to that box said anything on IRC. Does that mean that those other alerts didn't fire, or is it just that the IRC system was somehow smart about it?

Does that mean that those other alerts didn't fire, or is it just that the IRC system was somehow smart about it?

Since history rotates so quickly on icinga it's not easy to tell from notification logs anymore, but "Histogram" still has a spike
for a CRIT that happened at the same time, one each for:

You meant labnet1002 right? labnet1001 downtime was a while longer ago per Icinga

@Andrew Did that answer the question or should the ticket stay open?

Dzahn triaged this task as Medium priority.

ok, everyone seems to think that this is not a problem, so I will close and reopen if it turns out to be a problem at some point.