Page MenuHomePhabricator

alerting_host: Reduced availability for job icinga-am after failover event
Closed, DeclinedPublic

Description

After moving the active alerting host from alert1001 to alert2001 this alert fired:

(JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable

However I don't think anything is actually wrong since icinga is not active on alert1001 currently, and puppet is ensuring the related service is stopped on the non-active host:

(alert1001)
Notice: /Stage[main]/Prometheus::Icinga_exporter/Systemd::Service[prometheus-icinga-am]/Service[prometheus-icinga-am]/ensure: ensure changed 'running' to 'stopped' (corrective)

Still, it'd be good to find a way to prevent this from alerting in future alerting host failovers

Event Timeline

I wonder if simply leaving prometheus-icinga-am enabled on both alerting hosts would solve the issue? @fgiunchedi what do you think?

herron triaged this task as Medium priority.Apr 3 2023, 2:45 PM

Change 905244 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] icinga_exporter: run service on both active and standby hosts

https://gerrit.wikimedia.org/r/905244

If I recall correctly the idea is to send only the active icinga alerts to AM to reflect "reality" in the sense that we get notifications only from the active icinga, not both. In other words match what alerts users see on icinga.w.o vs alerts.w.o (with source=icinga).
The alert AFAICS does resolve itself after puppet has run on alert hosts and prometheus hosts (codfw/eqiad) and things converge, HTH!

Thank you! Thinking out loud we could potentially take the integration a step further and deduplicate alerts from multiple concurrent icingas. As in icinga and the prometheus icinga alertmanager exporter would run active/active, but we configure the prometheus side to display/send alerts once.

In theory that'd also allow us to (eventually) transition icinga notifications to alertmanager and move away from the notion of an "active host" which would eliminate our current icinga spof, and need for puppet based icinga failover too.

I was thinking enabling the exporter on both nodes would be a step in that direction, although the related patch may come with side-effects as-is. Off hand do you know if enabling prometheus-icinga-am on both alert hosts would result in duplicate alerts appearing in karma?

Thank you! Thinking out loud we could potentially take the integration a step further and deduplicate alerts from multiple concurrent icingas. As in icinga and the prometheus icinga alertmanager exporter would run active/active, but we configure the prometheus side to display/send alerts once.

I'm not sure I understand the last part "we configure the prometheus side to display/send alerts once"

In theory that'd also allow us to (eventually) transition icinga notifications to alertmanager and move away from the notion of an "active host" which would eliminate our current icinga spof, and need for puppet based icinga failover too.

I was thinking enabling the exporter on both nodes would be a step in that direction, although the related patch may come with side-effects as-is. Off hand do you know if enabling prometheus-icinga-am on both alert hosts would result in duplicate alerts appearing in karma?

Yes I think alertmanager would already de-duplicate the alerts if we were to run icinga-am active/active.

My concern is icinga "split brain" in which the passive host has firing alerts and the active host doesn't. Those alerts reach alertmanager on alerts.w.o but don't show up in icinga.w.o. Maybe it isn't a real issue though! With that said, personally I think focusing on moving off Icinga is the priority vs "active/active icinga", also given how infrequently we do icinga/alert failovers

Change 905244 abandoned by Herron:

[operations/puppet@production] icinga_exporter: run service on both active and standby hosts

Reason:

https://gerrit.wikimedia.org/r/905244