Page MenuHomePhabricator

SystemdUnitFailed alerts are too noisy for data-persistence
Open, Needs TriagePublic

Description

Hi,

Over the weekend, we had a prometheus-mysqld-exporter failure that resulted in SystemdUnitFailed alerts firing. This resulted in a post to our IRC channel #wikimedia-data-persistence and email to the team every four hours (until silenced).

This is too noisy by far for our purposes; we would like instead for SystemdUnitFailed alerts to email the team once and that's all - our expectation is that anything that requires urgent action should have a more specific alert attached, and that generally a failed Systemd Unit is something that requires attention but not urgently. We are keen to avoid alerting overload.

It's not clear to me how this can be achieved, since AFAICT SystemdUnitFailed is set to critical severity site-wide (in team-sre/systemd.yaml), hence this ticket asking for your assistance please :)

Event Timeline

Thank you for reaching out; I generally agree with the rationale, and I'm ok to try a larger repeat_interval for SystemdUnitFailed. I'll send a patch to implement that for any SystemdUnitFailed alert regardless of team, though we can tune it as needed.

Change 1003009 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] alertmanager: re-notify for SystemdUnitFailed after 24h

https://gerrit.wikimedia.org/r/1003009

Change 1003009 merged by Filippo Giunchedi:

[operations/puppet@production] alertmanager: re-notify for SystemdUnitFailed after 24h

https://gerrit.wikimedia.org/r/1003009

Thanks, this is definitely a step in the right direction :)

@fgiunchedi did this change get undeployed somehow? we've had alerts every 4 hours about SystemdUnitFailed on db2202:9100
(since 19:32 UTC yesterday)...

The configuration hasn't changed, though we did upgrade to Bookworm and together with that came a new version of Alertmanager, thus it might be a regression

I can confirm that we're definitely getting alerts by email and IRC ever 4 hours again now :(
( for wmf_auto_restart_prometheus-mysqld-exporter@x1.service on db2101:9100 - should be evident in the channel logs for #wikimedia-data-persistence)