Page MenuHomePhabricator

SystemdUnitFailed alert aggregation issues
Open, Needs TriagePublic

Description

I think that the current rules for aggregating the SystemdUnitFailed alerts are hiding a lot of information. For example the I/F team recently had on IRC this alert:

Wed 09:48:02   jinxer-wm| (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

This is what hiding behind this alert:

  • The alerts are only 22, but reported 44 because of T353457
  • 21 of the 22 alerts are on pki1001, 1 alert on netbox1002, but that's the host reported
  • all alerts are for different unit names

Some concern about the current behaviour:

  • It doesn't say what is aggregating, which part of the alert are common and which not
  • Having it reporting one random combination of host-unit_name doesn't gives the user any useful information, they could indicate all different problems, what do they have in common? What's the rationale for aggregating them this way?
  • I could see some usefulness if we were aggregating the same unit name across multiple hosts (but clarifying in the message that, so that the unit is what's failing and on how many hosts and then adding 1 hostname as example)
  • I'm not sure if there could be any value in the aggregation of multiple failing units on the same host, while from one side if could just indicate multiple failures on the host it might also just aggregate different problems.

Event Timeline

Thank you for the report, in general I agree we should be aggregating on the unit name itself and that would make the alert more clear; to achieve this we can change the grouping logic when routing alerts, I'll take a stab at it next week