SystemdUnitFailed alerts are too noisy for data-persistence
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	Feb 12 2024, 5:24 PM

Description

Hi,

Over the weekend, we had a prometheus-mysqld-exporter failure that resulted in SystemdUnitFailed alerts firing. This resulted in a post to our IRC channel #wikimedia-data-persistence and email to the team every four hours (until silenced).

This is too noisy by far for our purposes; we would like instead for SystemdUnitFailed alerts to email the team once and that's all - our expectation is that anything that requires urgent action should have a more specific alert attached, and that generally a failed Systemd Unit is something that requires attention but not urgently. We are keen to avoid alerting overload.

It's not clear to me how this can be achieved, since AFAICT SystemdUnitFailed is set to critical severity site-wide (in team-sre/systemd.yaml), hence this ticket asking for your assistance please :)

Details

	Subject	Repo	Branch	Lines +/-
	alertmanager: re-notify for SystemdUnitFailed after 24h	operations/puppet	production	+5 -0

Customize query in gerrit

Event Timeline

MatthewVernon created this task.Feb 12 2024, 5:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 12 2024, 5:24 PM

lmata subscribed.Feb 12 2024, 6:17 PM

Thank you for reaching out; I generally agree with the rationale, and I'm ok to try a larger repeat_interval for SystemdUnitFailed. I'll send a patch to implement that for any SystemdUnitFailed alert regardless of team, though we can tune it as needed.

Change 1003009 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: re-notify for SystemdUnitFailed after 24h

https://gerrit.wikimedia.org/r/1003009

gerritbot added a project: Patch-For-Review.Feb 13 2024, 3:34 PM

Change 1003009 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: re-notify for SystemdUnitFailed after 24h

https://gerrit.wikimedia.org/r/1003009

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2024, 9:30 AM

Thanks, this is definitely a step in the right direction :)

@fgiunchedi did this change get undeployed somehow? we've had alerts every 4 hours about SystemdUnitFailed on db2202:9100
(since 19:32 UTC yesterday)...

The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression

I can confirm that we're definitely getting alerts by email and IRC ever 4 hours again now :(
( for wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - should be evident in the channel logs for #wikimedia-data-persistence)

SystemdUnitFailed alerts are too noisy for data-persistenceOpen, Needs TriagePublicActions

Description

Details

Event Timeline

SystemdUnitFailed alerts are too noisy for data-persistence
Open, Needs TriagePublic
Actions