Page MenuHomePhabricator

Route systemd unit alerts to the correct team
Closed, ResolvedPublic

Description

Currently alerts for failed systemd units are routed to specific teams based on the role owner, in many cases this is correct however there are some instances where this is not the correct thing to do. one example of this are the various httpbb_* checks that run on the cumin host. Theses checks belong to the service ops team but they run on a machine owned by the Infrastructure foundations team.

On solution to theses specific alerts would be to move the checks to a machine owned by service ops which would work if this is the only such alert. however i suspect there are likely others as such it would be nice to somehow send a signal from the systemd unit to override the owner.

Event Timeline

Thanks for the task! I think another potential use case are the docker-reporter* units on the build host.

Indeed thank you @jbond for the task. The "signal" we could send is to be able to associate a team to a unit at the puppet level, then at the host level gather such information and export it into a per-unit metric, akin to role_owner (e.g. systemd_unit_owner). The logic then at the alert level would be similar as role_owner, plus catering for the fact that a missing systemd_unit_owner should fallback to role_owner.

Change 968293 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] systemd::service: Add service owner parameter

https://gerrit.wikimedia.org/r/968293

Change 968293 merged by Jbond:

[operations/puppet@production] systemd::service: Add service owner parameter

https://gerrit.wikimedia.org/r/968293

Change 969121 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] docker::reporter: route k8s alerts to service ops

https://gerrit.wikimedia.org/r/969121

Change 969121 merged by Jbond:

[operations/puppet@production] docker::reporter: route k8s alerts to service ops

https://gerrit.wikimedia.org/r/969121

Change 969312 had a related patch set uploaded (by Jbond; author: jbond):

[operations/alerts@master] team-sre/systemd: update systemd checks to make use of systemd_unit_owner

https://gerrit.wikimedia.org/r/969312

Change 969312 merged by Jbond:

[operations/alerts@master] team-sre/systemd: update systemd checks to make use of systemd_unit_owner

https://gerrit.wikimedia.org/r/969312

jbond claimed this task.

We now have a team parameter on the systemd::unit, systemd::service and systemd::timer::job which should allow us to correctly route alerts. this has already been configured for the units mentioned in this task

Change 970402 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/alerts@master] team-sre: ignore systemd_unit_.+_owner stale textfile

https://gerrit.wikimedia.org/r/970402

Change 970402 merged by Filippo Giunchedi:

[operations/alerts@master] team-sre: ignore systemd_unit_.+_owner stale textfile

https://gerrit.wikimedia.org/r/970402