Hi,
Over the weekend, we had a prometheus-mysqld-exporter failure that resulted in SystemdUnitFailed alerts firing. This resulted in a post to our IRC channel #wikimedia-data-persistence and email to the team every four hours (until silenced).
This is too noisy by far for our purposes; we would like instead for SystemdUnitFailed alerts to email the team once and that's all - our expectation is that anything that requires urgent action should have a more specific alert attached, and that generally a failed Systemd Unit is something that requires attention but not urgently. We are keen to avoid alerting overload.
It's not clear to me how this can be achieved, since AFAICT SystemdUnitFailed is set to critical severity site-wide (in team-sre/systemd.yaml), hence this ticket asking for your assistance please :)