Page MenuHomePhabricator

Investigate usage of service dependencies in icinga
Closed, DeclinedPublic

Description

Icinga/nagios support the concept of service dependencies. The idea is to have services dependent on other services so that alerts for some services are not sent, spamming mailboxes/channels/paging devices. The most obvious first case would be to stop the flood of alerts we receive when a host goes down and icinga alerting for every host service

Event Timeline

faidon triaged this task as Medium priority.Jul 20 2017, 1:20 PM

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down. However, it looks like our current check intervals may be transitioning services down before host down.

Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)

Host check - 6 min total until “down”:

check_interval     5m
retry_interval     1m
max_check_attempts  2

Service check - 3 min total until “down”:

check_interval     1m
retry_interval     1m
max_check_attempts  3

Maybe we could try aligning the check_interval of host and service checks, see if it helps?

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down.

That understanding is correct, the mentioned condition though is not. So, host checking in nagios/icinga happens on demand, which effectively means a failed service check, schedules a host check. Host checks DO NOT happen on the standard schedule. This is documented in https://www.icinga.com/docs/icinga1/latest/en/checkscheduling.html#hostcheckscheduling. The check_interval directive on the host stanza is used only when Obsess Over Hosts Option is set to 1.

Which means that quite often what happens is that a service check fails, the host check is scheduled, but in the meantime more service checks have failed and alert. This is exacerbated by our rather large queue (as we have a large installation).

However, it looks like our current check intervals may be transitioning services down before host down.

Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)

Host check - 6 min total until “down”:

check_interval     5m
retry_interval     1m
max_check_attempts  2

Service check - 3 min total until “down”:

check_interval     1m
retry_interval     1m
max_check_attempts  3

Maybe we could try aligning the check_interval of host and service checks, see if it helps?

It's the retry_interval and max_check_attempts that's important here and those are aligned already. But we can set check_interval to 1m as well for consistency.

fgiunchedi subscribed.

We're moving away from Icinga, thus I'm declining the task, at the same time the issue does remain valid in an Alertmanager world where we can (and do) use "alert inhibition" feature (e.g. to silence warnings if a critical of the same alert is already firing) and we can certainly extend those to cover more use cases.