Page MenuHomePhabricator

Improve alerting for flaky mw-cron jobs
Closed, ResolvedPublic

Description

In order to refine our alerting for mw-cron jobs, allow setting successfulJobsHistoryLimit and failedJobsHistoryLimit in the mediawiki helm chart.

When combined with a well scoped ttlSecondsAfterFinished, this would allow us to then override the general alert (which fires when 1 of the jobs in failedJobsHistoryLimit) to fire when a set number of jobs in the last ttlSecondsAfterFinished have failed.

Event Timeline

Clement_Goubert changed the task status from Open to In Progress.
Clement_Goubert triaged this task as Medium priority.

Change #1156288 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add job history limit control

https://gerrit.wikimedia.org/r/1156288

Change #1156288 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add job history limit control

https://gerrit.wikimedia.org/r/1156288

Mentioned in SAL (#wikimedia-operations) [2025-06-12T10:33:55Z] <cgoubert@deploy1003> Started scap sync-world: 1156288: mediawiki: Add job history limit control - T395885

Mentioned in SAL (#wikimedia-operations) [2025-06-12T10:36:43Z] <cgoubert@deploy1003> Finished scap sync-world: 1156288: mediawiki: Add job history limit control - T395885 (duration: 02m 48s)

Change #1156295 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_job: Add job history limit control

https://gerrit.wikimedia.org/r/1156295

Change #1156295 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_job: Add job history limit control

https://gerrit.wikimedia.org/r/1156295