Page MenuHomePhabricator

De-noise systemd alerts (Reduce Icinga alert noise goal)
Open, Stalled, Needs TriagePublic

Description

Currently systemd is one of the top alert producers. This is a tracking task to improve the SNR of systemd monitoring/alerting.

Details

Related Gerrit Patches:

Event Timeline

herron created this task.Aug 15 2019, 7:15 PM

Change 530442 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] check_systemd_state: downgrade 'degraded' status to warning

https://gerrit.wikimedia.org/r/530442

Since check systemd is a secondary monitor (important services are monitored via dedicated service specific checks) I think we can reduce the severity of the generic systemd alerts to warning and have them display in the icinga UI, without alerting on IRC.

fgiunchedi moved this task from Inbox to In progress on the observability board.Aug 19 2019, 12:52 PM

Change 533282 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: aggregate systemd failed metrics

https://gerrit.wikimedia.org/r/533282

Change 533563 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: aggregate ipsec_status and add alert

https://gerrit.wikimedia.org/r/533563

Change 533282 merged by Herron:
[operations/puppet@production] prometheus: aggregate systemd failed metrics

https://gerrit.wikimedia.org/r/533282

Change 535697 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add alert for widespread systemd failed units

https://gerrit.wikimedia.org/r/535697

Joe added a subscriber: Joe.Sep 13 2019, 6:28 AM

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Any systemd unit failing is an issue that needs to either be managed or be well known.

Also, this is proposing to aggregate alerts across different clusters, which is even more worrisome.

So the way I'd go to make the systemd failed alerts not be noise is:

  • Report in the alert which units have failed
  • *maybe* aggregate similar alters for the same cluster. But even that's debatable.
  • Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

Thanks for this feedback @Joe it's quite helpful! Sorry to bottom quote so much!

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Maybe 's/are monitored/should be monitored'? IMO systemd is a useful generic check to catch things that slip through the cracks, but doesn't give a full picture of service health, only that a unit isn't failed.

Any systemd unit failing is an issue that needs to either be managed or be well known.

Ok, if this is the case our normal state would presumably be 0 failed units? We could set an alert threshold of >= 1 failed unit anywhere in the fleet, site, or per-cluster.

FWIW at the present time we have 13 failed units, which can be seen at a glance here https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1

Doable, but it would mean being more diligent about clearing failed units than at present.

Also, this is proposing to aggregate alerts across different clusters, which is even more worrisome.
So the way I'd go to make the systemd failed alerts not be noise is:

  • Report in the alert which units have failed
  • *maybe* aggregate similar alters for the same cluster. But even that's debatable.

Off hand I can think of a couple possible approaches:

  • Per-site checks that aggregate clusters, with alert output to the effect of: (using codfw as an example)
cluster={labtest,misc,puppet} instance={labtestpuppetmaster2001:9100,netbox2001:9100,netflow2001:9100,puppetdb2002:9100} name={kafkatee.service,netbox_dump_run.service,postgresql@11-main.service,puppet-master.service} site=codfw
  • Per-cluster checks that aggregate units, alert output to the effect of (using labtest as an example):
cluster=labtest instance=labtestpuppetmaster2001:9100 name=puppet-master.service site=codfw

I'd be more inclined to start with something like the first, since that would translate to 5 icinga checks and aligns with the model used for other checks e.g. widespread puppet failures.

The second would be 65 checks which seems unwieldy.

  • Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

This I'd prefer to keep out of scope. To the earlier point of addressing or knowing of each failed unit, it seems we should then avoid situations were we are ignoring units.

With all that said it may make sense to use a layered approach here. Something like:

  • Grafana dashboard with birds-eye view
  • Per-host checks - UI only
  • Per-site checks as described above - alerting
  • Fleet-wide check with a higher threshold, to catch a significant increase in failures - alerting

Change 536642 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] prometheus: add per-site systemd failed unit checks

https://gerrit.wikimedia.org/r/536642

Joe added a comment.Sep 25 2019, 10:48 AM
  • Allow a blacklist of units we don't want to alert about, that should be able to change per-server.

This I'd prefer to keep out of scope. To the earlier point of addressing or knowing of each failed unit, it seems we should then avoid situations were we are ignoring units.

Actually, there are alerts coming from units that fail for know reasons and that will self-heal without creating issues to the system. These fail because of bugs we know about but don't have priority to fix, or that upstream hasn't solved yet or will be solved in the next debian release. I think this is important to avoid noise creeping up.

With all that said it may make sense to use a layered approach here. Something like:

  • Grafana dashboard with birds-eye view
  • Per-host checks - UI only
  • Per-site checks as described above - alerting

I think this approach is wrong. Per-site checks on systemd unit failures tell us nothing useful. I think we should keep per-host checks alerting as they are now, but specify the units that are failing, and not alerting if the unit is in a blacklist, that can for now be global.

Why I think this? My approach to choosing wether to alert or not is "what will alert me about something that needs to be taken care of?".

So, say mcrouter crashes on mw12345; Do I want to be alerted, or do I want to find out only when I go look at the icinga UI? I would say the former.

Hence, I don't think in this case cluster-wide alerts make sense either.

  • Fleet-wide check with a higher threshold, to catch a significant increase in failures - alerting

Again, I don't think this would give me any actionable information, and would basically be noise.

Joe added a comment.Sep 25 2019, 10:52 AM

Also:

It's not true that "important services are monitored via dedicated service specific checks", quite the contrary on a lot of systems, I would rather improve the systemd alert instead of silencing it, and maybe be finally done with using those hacky checks for the number of running processes.

Maybe 's/are monitored/should be monitored'? IMO systemd is a useful generic check to catch things that slip through the cracks, but doesn't give a full picture of service health, only that a unit isn't failed.

The way we monitor services that are not public-facing are running is via check_procs, which is a really hacky way to do that.

To know if a service, or even if a timer has ran correctly, I want to ask systemd; after all, we have an init system that stores the state of the individual units; IMHO it's a much better API to inspect if a unit is running as expected than grepping the process table by hand.

herron changed the task status from Open to Stalled.Sep 25 2019, 2:19 PM

Considering this stalled for now. Let's revisit after some progress has been made with alertmanager aggregation. Better options for handling these types of alerts should be available to us then.

Joe added a comment.Sep 26 2019, 6:53 AM

Considering this stalled for now. Let's revisit after some progress has been made with alertmanager aggregation. Better options for handling these types of alerts should be available to us then.

I am sorry, but my proposal can be implemented, in the context of this task, without involving any form of aggregation.

As I said, the important parts of what would reduce noise can be implemented quite easily without the need for aggregation.

I think we should keep talking about this before moving to implement something. The more I think about it, failed units are a type of problem with an urgency that sits in between always and never alerting, and our current tooling doesn't handle this well.

Also, since we have a single check today covering any unit on any host, it's hard to assign a severity. So we have to assume the worst and alert for any failed unit. Adding service specific checks in places where we are aware that systemd unit is the only monitoring could help, in that this would allow us to assign a lower severity to systemd alerts.

Regarding a blacklist to silence certain alerts -- this could help, but also is a need not limited to systemd checks, and is one of the features that an aggregation layer will provide in a generalized way. So best to come back to this in the future as well, IMO.

Change 536642 abandoned by Herron:
prometheus: add per-site systemd failed unit checks

Reason:
tabling for now

https://gerrit.wikimedia.org/r/536642

Change 530442 abandoned by Herron:
check_systemd_state: downgrade 'degraded' status to warning

Reason:
holding off on this for now

https://gerrit.wikimedia.org/r/530442

fgiunchedi moved this task from In progress to Inbox on the observability board.Nov 25 2019, 4:11 PM
fgiunchedi moved this task from Inbox to Backlog on the observability board.Dec 10 2019, 2:01 PM