Page MenuHomePhabricator

Investigate usage of service dependencies in icinga
Open, MediumPublic

Description

Icinga/nagios support the concept of service dependencies. The idea is to have services dependent on other services so that alerts for some services are not sent, spamming mailboxes/channels/paging devices. The most obvious first case would be to stop the flood of alerts we receive when a host goes down and icinga alerting for every host service

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2016, 4:00 PM
faidon triaged this task as Medium priority.Jul 20 2017, 1:20 PM
herron added a subscriber: herron.Jul 28 2017, 8:44 PM

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down. However, it looks like our current check intervals may be transitioning services down before host down.

Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)

Host check - 6 min total until “down”:

check_interval     5m
retry_interval     1m
max_check_attempts  2

Service check - 3 min total until “down”:

check_interval     1m
retry_interval     1m
max_check_attempts  3

Maybe we could try aligning the check_interval of host and service checks, see if it helps?

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down.

That understanding is correct, the mentioned condition though is not. So, host checking in nagios/icinga happens on demand, which effectively means a failed service check, schedules a host check. Host checks DO NOT happen on the standard schedule. This is documented in https://www.icinga.com/docs/icinga1/latest/en/checkscheduling.html#hostcheckscheduling. The check_interval directive on the host stanza is used only when Obsess Over Hosts Option is set to 1.

Which means that quite often what happens is that a service check fails, the host check is scheduled, but in the meantime more service checks have failed and alert. This is exacerbated by our rather large queue (as we have a large installation).

However, it looks like our current check intervals may be transitioning services down before host down.
Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)
Host check - 6 min total until “down”:

check_interval     5m
retry_interval     1m
max_check_attempts  2

Service check - 3 min total until “down”:

check_interval     1m
retry_interval     1m
max_check_attempts  3

Maybe we could try aligning the check_interval of host and service checks, see if it helps?

It's the retry_interval and max_check_attempts that's important here and those are aligned already. But we can set check_interval to 1m as well for consistency.

fgiunchedi moved this task from Inbox to Backlog on the observability board.Dec 10 2019, 2:17 PM