Page MenuHomePhabricator

Periodic job alerting
Closed, ResolvedPublic

Description

With the migration to MW-on-K8s and kubernetes CronJobs, we have to find a way to alert correctly when a job fails.

On the kubernetes side:
kube-state-metrics exports the kube_job_failed metrics, create a prometheus alert when a job fails.

This metrics is at the Job level, not the CronJob level, which means we can't aggregate on the number of failures of the CronJob itself. It would be nice to add the following kubernetes labels to the information gathered by kube-state-metrics: script, cronjob and team, as well as passing down the comment annotation to the job (which should probably be a label instead).

Then we would need to aggregate failures on the cronjob label.

The team label would ideally be a phabricator project tag, that can then be used in alertmanager to open a task to the right project on CronJob failure.

Details

Related Changes in Gerrit:

Event Timeline

Clement_Goubert triaged this task as Medium priority.

@kamila do you think you could help with that?

Sure, I'll prepare a patch that adds those labels. For the comment, it looks like I can also export an annotation as a Prometheus annotation, I can include that too.

In T385709#10526044, @kamila wrote:

@kamila do you think you could help with that?

For the comment, it looks like I can also export an annotation as a Prometheus annotation, I can include that too.

Ah that would be great, thank you very much, then I'd just need to pass the annotation down to the job.

Change #1117574 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] kube-state-metrics: export extra jobs labels

https://gerrit.wikimedia.org/r/1117574

Change #1117895 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Pass down CronJob description to Job

https://gerrit.wikimedia.org/r/1117895

Change #1117574 merged by jenkins-bot:

[operations/deployment-charts@master] kube-state-metrics: export extra jobs labels

https://gerrit.wikimedia.org/r/1117574

Change #1117895 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Pass down CronJob description to Job

https://gerrit.wikimedia.org/r/1117895

Change #1122563 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: CronJob name as Job label

https://gerrit.wikimedia.org/r/1122563

Hmm so obviously it's not as simple as I thought it would be. If I understand our alertmanager configuration correctly, I'd need to setup a route and receivers for every team that doesn't already have one.

Although I am not sure why our configuration is using the numerical PHID for projects and we could maybe implement a more general routing configuration that would send to alerts?project={{$labels.label_team}}.

SRE Observability any thoughts on if we can/how to implement this?

Change #1122563 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: CronJob name as Job label

https://gerrit.wikimedia.org/r/1122563

Hmm so obviously it's not as simple as I thought it would be. If I understand our alertmanager configuration correctly, I'd need to setup a route and receivers for every team that doesn't already have one.

Although I am not sure why our configuration is using the numerical PHID for projects and we could maybe implement a more general routing configuration that would send to alerts?project={{$labels.label_team}}.

IIRC the PHIDs are there because there is almost never a 1:1 mapping between team label and the corresponding phabricator project. Using PHIDs also makes it easier for teams to re-route their tasks to a different place, e.g. during reorgs. Also if the conduit query for project returns multiple PHIDs only the first PHID will be considered to be matching.

SRE Observability any thoughts on if we can/how to implement this?

We certainly can, the easiest path is to onboard teams that don't already have a receiver in alertmanager. How many teams would that be ATM?

Hmm so obviously it's not as simple as I thought it would be. If I understand our alertmanager configuration correctly, I'd need to setup a route and receivers for every team that doesn't already have one.

Although I am not sure why our configuration is using the numerical PHID for projects and we could maybe implement a more general routing configuration that would send to alerts?project={{$labels.label_team}}.

IIRC the PHIDs are there because there is almost never a 1:1 mapping between team label and the corresponding phabricator project. Using PHIDs also makes it easier for teams to re-route their tasks to a different place, e.g. during reorgs. Also if the conduit query for project returns multiple PHIDs only the first PHID will be considered to be matching.

Ah right. Although in that case, it would not necessarily be teams (yes, I chose the label poorly) but most probably projects since that's what matches more closely with maintenance scripts.

SRE Observability any thoughts on if we can/how to implement this?

We certainly can, the easiest path is to onboard teams that don't already have a receiver in alertmanager. How many teams would that be ATM?

For now I've identified around 15 teams as being responsible for at least one periodic job.

Some of them are probably already onboarded, I also have a bunch of jobs I haven't yet identified the ownership for. You can find the full list of jobs here if you know some of them, the team list is on the second sheet.

Hmm so obviously it's not as simple as I thought it would be. If I understand our alertmanager configuration correctly, I'd need to setup a route and receivers for every team that doesn't already have one.

Although I am not sure why our configuration is using the numerical PHID for projects and we could maybe implement a more general routing configuration that would send to alerts?project={{$labels.label_team}}.

IIRC the PHIDs are there because there is almost never a 1:1 mapping between team label and the corresponding phabricator project. Using PHIDs also makes it easier for teams to re-route their tasks to a different place, e.g. during reorgs. Also if the conduit query for project returns multiple PHIDs only the first PHID will be considered to be matching.

Ah right. Although in that case, it would not necessarily be teams (yes, I chose the label poorly) but most probably projects since that's what matches more closely with maintenance scripts.

Thank you, that explains. For this use case then we can route the alert(s) themselves in alertmanager to a single receiver that will create tasks with project=label as you have outlined.

SRE Observability any thoughts on if we can/how to implement this?

We certainly can, the easiest path is to onboard teams that don't already have a receiver in alertmanager. How many teams would that be ATM?

For now I've identified around 15 teams as being responsible for at least one periodic job.

Some of them are probably already onboarded, I also have a bunch of jobs I haven't yet identified the ownership for. You can find the full list of jobs here if you know some of them, the team list is on the second sheet.

Ok seems a manageable number, I'm happy to help/assist with the routing bits. Unfortunately I can't help with tracking down ownership :(

Change #1131025 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Add mediawiki-platform-task

https://gerrit.wikimedia.org/r/1131025

Change #1131356 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] team-sre: Add mw-cron alerting

https://gerrit.wikimedia.org/r/1131356

Change #1131025 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Add mediawiki-platform-task

https://gerrit.wikimedia.org/r/1131025

Change #1131356 merged by jenkins-bot:

[operations/alerts@master] team-sre: Add mw-cron alerting

https://gerrit.wikimedia.org/r/1131356

Change #1131680 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mw-cron: Add warning for serviceops dashboard

https://gerrit.wikimedia.org/r/1131680

Change #1131680 merged by jenkins-bot:

[operations/alerts@master] mw-cron: Add warning for serviceops dashboard

https://gerrit.wikimedia.org/r/1131680

Change #1132673 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Route task-level GrowthExperiments alerts

https://gerrit.wikimedia.org/r/1132673

Change #1133096 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mw-cron: Fix kubectl invocation in alert description

https://gerrit.wikimedia.org/r/1133096

Change #1133096 merged by jenkins-bot:

[operations/alerts@master] mw-cron: Fix kubectl invocation in alert description

https://gerrit.wikimedia.org/r/1133096

Change #1132673 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Route task-level GrowthExperiments alerts

https://gerrit.wikimedia.org/r/1132673

Change #1134696 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] alertmanager: add task receivers for 4 teams

https://gerrit.wikimedia.org/r/1134696

Change #1135005 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] alertmanager: route T&S tasks to their Slack

https://gerrit.wikimedia.org/r/1135005

Updated https://wikitech.wikimedia.org/wiki/Periodic_jobs#Monitoring to be clearer on what alerting is set up, and when the probe fires.

ttlSecondsAfterFinished is by default set at 1 day, which means the alert will autoresolve 1 day after the last Failed run. This will not close a phabricator task, but will make the warning in the serviceops dashboard disappear. We should probably tune that at some point (each job can already set its own ttlSecondsAfterFinished in the chart, but the puppet code to add it to the yaml def isn't there yet).

Change #1135040 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mwcron: Allow setting ttlsecondsafterfinished

https://gerrit.wikimedia.org/r/1135040

Change #1135413 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] alertmanager: add route for task-severity data-persistence alerts

https://gerrit.wikimedia.org/r/1135413

Change #1135418 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/puppet@production] alertmanager: Route 3 teams' task-severity alerts to Phab

https://gerrit.wikimedia.org/r/1135418

Change #1135040 merged by Clément Goubert:

[operations/puppet@production] mwcron: Allow setting ttlsecondsafterfinished

https://gerrit.wikimedia.org/r/1135040

Change #1135413 abandoned by Clément Goubert:

[operations/puppet@production] alertmanager: add route for task-severity data-persistence alerts

Reason:

Added in Ibbdbb4d46b4684649c4800a81b8835bae0bdcc79

https://gerrit.wikimedia.org/r/1135413

Change #1134696 merged by Clément Goubert:

[operations/puppet@production] alertmanager: add task receivers for 4 teams

https://gerrit.wikimedia.org/r/1134696

Change #1135418 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Route 4 teams' task-severity alerts to Phab

https://gerrit.wikimedia.org/r/1135418

Change #1135753 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Add team/project receivers for Phab

https://gerrit.wikimedia.org/r/1135753

Change #1135754 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Add routing for task alerts

https://gerrit.wikimedia.org/r/1135754

Change #1135753 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Add team/project receivers for Phab

https://gerrit.wikimedia.org/r/1135753

Change #1135754 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Add routing for task alerts

https://gerrit.wikimedia.org/r/1135754

Change #1135779 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Fix bad indentation

https://gerrit.wikimedia.org/r/1135779

Change #1135779 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Fix bad indentation

https://gerrit.wikimedia.org/r/1135779

Clement_Goubert added a subscriber: taavi.

All teams and tags are now added to AlertManager. I'm keeping the task open to attach any further changes/improvements to how we monitor and alert for mw-cron.

Clement_Goubert changed the task status from Open to In Progress.Apr 11 2025, 11:13 AM

Change #1135005 merged by Kamila Součková:

[operations/puppet@production] alertmanager: route T&S tasks to their Slack

https://gerrit.wikimedia.org/r/1135005

Change #1147699 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] am: mw-cron: Add Wikimedia-production-error tag

https://gerrit.wikimedia.org/r/1147699

Change #1147702 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] alertmanager: Use alert summary as title for most tasks

https://gerrit.wikimedia.org/r/1147702

Change #1147699 merged by Clément Goubert:

[operations/puppet@production] am: mw-cron: Add Wikimedia-production-error tag

https://gerrit.wikimedia.org/r/1147699

Change #1147702 merged by Clément Goubert:

[operations/puppet@production] alertmanager: Use alert summary as title for most tasks

https://gerrit.wikimedia.org/r/1147702

Alerting is setup for all jobs, resolving, we can attach future improvements to a future task that isn't migration-specific.