alerting_host: Reduced availability for job icinga-am after failover event
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	herron
	Apr 3 2023, 2:43 PM

Description

After moving the active alerting host from alert1001 to alert2001 this alert fired:

(JobUnavailable) firing: (2) Reduced availability for job icinga-am in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable

However I don't think anything is actually wrong since icinga is not active on alert1001 currently, and puppet is ensuring the related service is stopped on the non-active host:

(alert1001)
Notice: /Stage[main]/Prometheus::Icinga_exporter/Systemd::Service[prometheus-icinga-am]/Service[prometheus-icinga-am]/ensure: ensure changed 'running' to 'stopped' (corrective)

Still, it'd be good to find a way to prevent this from alerting in future alerting host failovers

Details

	Subject	Repo	Branch	Lines +/-
	icinga_exporter: run service on both active and standby hosts	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T253824 planned upstream deprecation of the ssh-rsa signing algorithm (RSA with SHA-1)
Resolved	ayounsi	T254013 all network devices must run OpenSSH >= 7.2p1 but != 7.4p1
Resolved	ayounsi	T317175 Junos: resolve DNS through mgmt_junos
Resolved	ayounsi	T327862 Use mgmt_junos on all network devices
		Restricted Task
Open	None	T316539 Upgrade network devices to Junos 20+
Resolved	ayounsi	T327248 eqiad/codfw virtual-chassis upgrades
Resolved	Clement_Goubert	T327920 March 2023 Datacenter Switchover
Resolved	ayounsi	T331882 eqiad row C switches upgrade
Resolved	herron	T333478 failover alert1001 to alert2001
Declined	herron	T333838 alerting_host: Reduced availability for job icinga-am after failover event

Event Timeline

I wonder if simply leaving prometheus-icinga-am enabled on both alerting hosts would solve the issue? @fgiunchedi what do you think?

herron triaged this task as Medium priority.Apr 3 2023, 2:45 PM

Change 905244 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] icinga_exporter: run service on both active and standby hosts

https://gerrit.wikimedia.org/r/905244

gerritbot added a project: Patch-For-Review.Apr 3 2023, 3:27 PM

If I recall correctly the idea is to send only the active icinga alerts to AM to reflect "reality" in the sense that we get notifications only from the active icinga, not both. In other words match what alerts users see on icinga.w.o vs alerts.w.o (with source=icinga).
The alert AFAICS does resolve itself after puppet has run on alert hosts and prometheus hosts (codfw/eqiad) and things converge, HTH!

Thank you! Thinking out loud we could potentially take the integration a step further and deduplicate alerts from multiple concurrent icingas. As in icinga and the prometheus icinga alertmanager exporter would run active/active, but we configure the prometheus side to display/send alerts once.

In theory that'd also allow us to (eventually) transition icinga notifications to alertmanager and move away from the notion of an "active host" which would eliminate our current icinga spof, and need for puppet based icinga failover too.

I was thinking enabling the exporter on both nodes would be a step in that direction, although the related patch may come with side-effects as-is. Off hand do you know if enabling prometheus-icinga-am on both alert hosts would result in duplicate alerts appearing in karma?

In T333838#8754983, @herron wrote:

Thank you! Thinking out loud we could potentially take the integration a step further and deduplicate alerts from multiple concurrent icingas. As in icinga and the prometheus icinga alertmanager exporter would run active/active, but we configure the prometheus side to display/send alerts once.

I'm not sure I understand the last part "we configure the prometheus side to display/send alerts once"

In theory that'd also allow us to (eventually) transition icinga notifications to alertmanager and move away from the notion of an "active host" which would eliminate our current icinga spof, and need for puppet based icinga failover too.

I was thinking enabling the exporter on both nodes would be a step in that direction, although the related patch may come with side-effects as-is. Off hand do you know if enabling prometheus-icinga-am on both alert hosts would result in duplicate alerts appearing in karma?

Yes I think alertmanager would already de-duplicate the alerts if we were to run icinga-am active/active.

My concern is icinga "split brain" in which the passive host has firing alerts and the active host doesn't. Those alerts reach alertmanager on alerts.w.o but don't show up in icinga.w.o. Maybe it isn't a real issue though! With that said, personally I think focusing on moving off Icinga is the priority vs "active/active icinga", also given how infrequently we do icinga/alert failovers

Change 905244 abandoned by Herron:

[operations/puppet@production] icinga_exporter: run service on both active and standby hosts

Reason:

https://gerrit.wikimedia.org/r/905244

herron closed this task as Declined.May 2 2023, 6:42 PM

Maintenance_bot removed a project: Patch-For-Review.May 2 2023, 7:11 PM

alerting_host: Reduced availability for job icinga-am after failover eventClosed, DeclinedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

alerting_host: Reduced availability for job icinga-am after failover event
Closed, DeclinedPublic
Actions

Related Objects
Search...