SystemdUnitFailed alert aggregation issues
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Volans
	Feb 28 2024, 10:12 AM

Description

I think that the current rules for aggregating the SystemdUnitFailed alerts are hiding a lot of information. For example the I/F team recently had on IRC this alert:

Wed 09:48:02   jinxer-wm| (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

This is what hiding behind this alert:

The alerts are only 22, but reported 44 because of T353457
21 of the 22 alerts are on pki1001, 1 alert on netbox1002, but that's the host reported
all alerts are for different unit names

Some concern about the current behaviour:

It doesn't say what is aggregating, which part of the alert are common and which not
Having it reporting one random combination of host-unit_name doesn't gives the user any useful information, they could indicate all different problems, what do they have in common? What's the rationale for aggregating them this way?
I could see some usefulness if we were aggregating the same unit name across multiple hosts (but clarifying in the message that, so that the unit is what's failing and on how many hosts and then adding 1 hostname as example)
I'm not sure if there could be any value in the aggregation of multiple failing units on the same host, while from one side if could just indicate multiple failures on the host it might also just aggregate different problems.

Related Objects

Mentioned Here: T353457: Karma UI shows duplicate alerts

Event Timeline

Volans created this task.Feb 28 2024, 10:12 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2024, 10:12 AM

lmata subscribed.Feb 28 2024, 1:02 PM

Thank you for the report, in general I agree we should be aggregating on the unit name itself and that would make the alert more clear; to achieve this we can change the grouping logic when routing alerts, I'll take a stab at it next week

SystemdUnitFailed alert aggregation issuesOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

SystemdUnitFailed alert aggregation issues
Open, Needs TriagePublic
Actions