Investigate usage of service dependencies in icinga
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	akosiaris
	Dec 12 2016, 4:00 PM

Description

Icinga/nagios support the concept of service dependencies. The idea is to have services dependent on other services so that alerts for some services are not sent, spamming mailboxes/channels/paging devices. The most obvious first case would be to stop the flood of alerts we receive when a host goes down and icinga alerting for every host service

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T152967 Investigate usage of service dependencies in icinga
		Resolved		herron	T172131 Investigate check_nrpe -u option to reduce critical alerts

Event Timeline

akosiaris created this task.Dec 12 2016, 4:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2016, 4:00 PM

faidon triaged this task as Medium priority.Jul 20 2017, 1:20 PM

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down. However, it looks like our current check intervals may be transitioning services down before host down.

Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)

Host check - 6 min total until “down”:

check_interval     5m
retry_interval     1m
max_check_attempts  2

Service check - 3 min total until “down”:

check_interval     1m
retry_interval     1m
max_check_attempts  3

Maybe we could try aligning the check_interval of host and service checks, see if it helps?

herron added a subtask: T172131: Investigate check_nrpe -u option to reduce critical alerts.Jul 31 2017, 3:54 PM

In T152967#3482611, @herron wrote:

With regard to stopping the flood of alerts during a host down condition; my understanding is that service -> host dependencies are automatic in nagios/icinga on the condition that host down happens before service down.

That understanding is correct, the mentioned condition though is not. So, host checking in nagios/icinga happens on demand, which effectively means a failed service check, schedules a host check. Host checks DO NOT happen on the standard schedule. This is documented in https://www.icinga.com/docs/icinga1/latest/en/checkscheduling.html#hostcheckscheduling. The check_interval directive on the host stanza is used only when Obsess Over Hosts Option is set to 1.

Which means that quite often what happens is that a service check fails, the host check is scheduled, but in the meantime more service checks have failed and alert. This is exacerbated by our rather large queue (as we have a large installation).

However, it looks like our current check intervals may be transitioning services down before host down.

Here are intervals pulled from a few hosts/services at random (from icinga “view config” in web)

Host check - 6 min total until “down”:
check_interval     5m
retry_interval     1m
max_check_attempts  2
Service check - 3 min total until “down”:
check_interval     1m
retry_interval     1m
max_check_attempts  3
Maybe we could try aligning the check_interval of host and service checks, see if it helps?

It's the retry_interval and max_check_attempts that's important here and those are aligned already. But we can set check_interval to 1m as well for consistency.

herron closed subtask T172131: Investigate check_nrpe -u option to reduce critical alerts as Resolved.Sep 1 2017, 2:56 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Dec 10 2019, 2:17 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:22 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Alerting; removed SRE Observability.Jan 31 2022, 1:36 PM

We're moving away from Icinga, thus I'm declining the task, at the same time the issue does remain valid in an Alertmanager world where we can (and do) use "alert inhibition" feature (e.g. to silence warnings if a critical of the same alert is already firing) and we can certainly extend those to cover more use cases.

lmata moved this task from Inbox to Done on the Observability-Alerting board.Jan 16 2023, 5:58 PM

Investigate usage of service dependencies in icingaClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate usage of service dependencies in icinga
Closed, DeclinedPublic
Actions

Related Objects
Search...