De-noise systemd alerts (Reduce Icinga alert noise goal)
Open, Stalled, Needs TriagePublic
Actions

Assigned To

None

Authored By

	herron
	Aug 15 2019, 7:15 PM

Description

Currently systemd is one of the top alert producers. This is a tracking task to improve the SNR of systemd monitoring/alerting.

Details

Subject	Repo	Branch	Lines +/-
prometheus: add alert for widespread systemd failed units	operations/puppet	production	+11 -0
check_systemd_state: downgrade 'degraded' status to warning	operations/puppet	production	+1 -1
prometheus: add per-site systemd failed unit checks	operations/puppet	production	+19 -1
prometheus: aggregate systemd failed metrics	operations/puppet	production	+17 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	fgiunchedi	T228379 Improve our alerting capabilities (Q1 goal FY19-20)
Resolved	fgiunchedi	T228878 Reduce Icinga alert noise
Stalled	None	T230570 De-noise systemd alerts (Reduce Icinga alert noise goal)

Event Timeline

herron created this task.Aug 15 2019, 7:15 PM

Change 530442 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] check_systemd_state: downgrade 'degraded' status to warning

https://gerrit.wikimedia.org/r/530442

gerritbot added a project: Patch-For-Review.Aug 15 2019, 7:17 PM

Since check systemd is a secondary monitor (important services are monitored via dedicated service specific checks) I think we can reduce the severity of the generic systemd alerts to warning and have them display in the icinga UI, without alerting on IRC.

fgiunchedi moved this task from Inbox to In progress on the observability board.Aug 19 2019, 12:52 PM

Change 533282 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: aggregate systemd failed metrics

https://gerrit.wikimedia.org/r/533282

Change 533563 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: aggregate ipsec_status and add alert

https://gerrit.wikimedia.org/r/533563

Change 533282 merged by Herron:
[operations/puppet@production] prometheus: aggregate systemd failed metrics

https://gerrit.wikimedia.org/r/533282

Change 535697 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add alert for widespread systemd failed units

https://gerrit.wikimedia.org/r/535697

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Any systemd unit failing is an issue that needs to either be managed or be well known.

Also, this is proposing to aggregate alerts across different clusters, which is even more worrisome.

So the way I'd go to make the systemd failed alerts not be noise is:

Report in the alert which units have failed
*maybe* aggregate similar alters for the same cluster. But even that's debatable.
Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

Thanks for this feedback @Joe it's quite helpful! Sorry to bottom quote so much!

In T230570#5489936, @Joe wrote:

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Maybe 's/are monitored/should be monitored'? IMO systemd is a useful generic check to catch things that slip through the cracks, but doesn't give a full picture of service health, only that a unit isn't failed.

Any systemd unit failing is an issue that needs to either be managed or be well known.

Ok, if this is the case our normal state would presumably be 0 failed units? We could set an alert threshold of >= 1 failed unit anywhere in the fleet, site, or per-cluster.

FWIW at the present time we have 13 failed units, which can be seen at a glance here https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1

Doable, but it would mean being more diligent about clearing failed units than at present.

Also, this is proposing to aggregate alerts across different clusters, which is even more worrisome.

So the way I'd go to make the systemd failed alerts not be noise is:

Report in the alert which units have failed

*maybe* aggregate similar alters for the same cluster. But even that's debatable.

Off hand I can think of a couple possible approaches:

Per-site checks that aggregate clusters, with alert output to the effect of: (using codfw as an example)

cluster={labtest,misc,puppet} instance={labtestpuppetmaster2001:9100,netbox2001:9100,netflow2001:9100,puppetdb2002:9100} name={kafkatee.service,netbox_dump_run.service,postgresql@11-main.service,puppet-master.service} site=codfw

Per-cluster checks that aggregate units, alert output to the effect of (using labtest as an example):

cluster=labtest instance=labtestpuppetmaster2001:9100 name=puppet-master.service site=codfw

I'd be more inclined to start with something like the first, since that would translate to 5 icinga checks and aligns with the model used for other checks e.g. widespread puppet failures.

The second would be 65 checks which seems unwieldy.

Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

This I'd prefer to keep out of scope. To the earlier point of addressing or knowing of each failed unit, it seems we should then avoid situations were we are ignoring units.

With all that said it may make sense to use a layered approach here. Something like:

Grafana dashboard with birds-eye view
Per-host checks - UI only
Per-site checks as described above - alerting
Fleet-wide check with a higher threshold, to catch a significant increase in failures - alerting

Change 536642 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add per-site systemd failed unit checks

https://gerrit.wikimedia.org/r/536642

In T230570#5490984, @herron wrote:

Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

This I'd prefer to keep out of scope. To the earlier point of addressing or knowing of each failed unit, it seems we should then avoid situations were we are ignoring units.

Actually, there are alerts coming from units that fail for know reasons and that will self-heal without creating issues to the system. These fail because of bugs we know about but don't have priority to fix, or that upstream hasn't solved yet or will be solved in the next debian release. I think this is important to avoid noise creeping up.

With all that said it may make sense to use a layered approach here. Something like:

Grafana dashboard with birds-eye view

Per-host checks - UI only

Per-site checks as described above - alerting

I think this approach is wrong. Per-site checks on systemd unit failures tell us nothing useful. I think we should keep per-host checks alerting as they are now, but specify the units that are failing, and not alerting if the unit is in a blacklist, that can for now be global.

Why I think this? My approach to choosing wether to alert or not is "what will alert me about something that needs to be taken care of?".

So, say mcrouter crashes on mw12345; Do I want to be alerted, or do I want to find out only when I go look at the icinga UI? I would say the former.

Hence, I don't think in this case cluster-wide alerts make sense either.

Fleet-wide check with a higher threshold, to catch a significant increase in failures - alerting

Again, I don't think this would give me any actionable information, and would basically be noise.

Also:

In T230570#5490984, @herron wrote:

In T230570#5489936, @Joe wrote:

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Maybe 's/are monitored/should be monitored'? IMO systemd is a useful generic check to catch things that slip through the cracks, but doesn't give a full picture of service health, only that a unit isn't failed.

The way we monitor services that are not public-facing are running is via check_procs, which is a really hacky way to do that.

To know if a service, or even if a timer has ran correctly, I want to ask systemd; after all, we have an init system that stores the state of the individual units; IMHO it's a much better API to inspect if a unit is running as expected than grepping the process table by hand.

Considering this stalled for now. Let's revisit after some progress has been made with alertmanager aggregation. Better options for handling these types of alerts should be available to us then.

In T230570#5522750, @herron wrote:

Considering this stalled for now. Let's revisit after some progress has been made with alertmanager aggregation. Better options for handling these types of alerts should be available to us then.

I am sorry, but my proposal can be implemented, in the context of this task, without involving any form of aggregation.

As I said, the important parts of what would reduce noise can be implemented quite easily without the need for aggregation.

I think we should keep talking about this before moving to implement something. The more I think about it, failed units are a type of problem with an urgency that sits in between always and never alerting, and our current tooling doesn't handle this well.

Also, since we have a single check today covering any unit on any host, it's hard to assign a severity. So we have to assume the worst and alert for any failed unit. Adding service specific checks in places where we are aware that systemd unit is the only monitoring could help, in that this would allow us to assign a lower severity to systemd alerts.

Regarding a blacklist to silence certain alerts -- this could help, but also is a need not limited to systemd checks, and is one of the features that an aggregation layer will provide in a generalized way. So best to come back to this in the future as well, IMO.

Change 536642 abandoned by Herron:
prometheus: add per-site systemd failed unit checks

Reason:
tabling for now

https://gerrit.wikimedia.org/r/536642

Change 530442 abandoned by Herron:
check_systemd_state: downgrade 'degraded' status to warning

Reason:
holding off on this for now

https://gerrit.wikimedia.org/r/530442

fgiunchedi moved this task from In progress to Inbox on the observability board.Nov 25 2019, 4:11 PM

fgiunchedi moved this task from Inbox to Backlog on the observability board.Dec 10 2019, 2:01 PM

Dzahn subscribed.Jul 6 2020, 11:44 PM

lmata edited projects, added SRE Observability; removed observability.Jul 12 2021, 2:22 AM

lmata moved this task from Inbox to Backlog on the SRE Observability board.Jul 15 2021, 4:09 AM

lmata edited projects, added Observability-Alerting; removed SRE Observability.Aug 9 2021, 2:36 AM

lmata moved this task from Inbox to Backlog on the Observability-Alerting board.Aug 10 2021, 3:18 PM

Had a shower of IRC alerts today after deploying the freeipmi-ipmiseld package, which isn't a critical situation but overwhelmed the operations channel with noise and caused the bot to be kicked and ircecho temporarily disabled (to avoid a recovery shower)

lmata moved this task from Backlog to Prioritized on the Observability-Alerting board.Sep 6 2022, 6:45 PM

Change #535697 abandoned by Herron:

[operations/puppet@production] prometheus: add alert for widespread systemd failed units

Reason:

spring cleaning -- stale patch