Page MenuHomePhabricator

Do not alert about a failed cron job when logs are already discarded
Open, LowPublic

Description

Today, I discovered Growth-Team received several MediaWikiCronJobFailed alerts. Following the manual at https://wikitech.wikimedia.org/wiki/Mw-cron_jobs#Troubleshooting, I checked the status:

[urbanecm@deploy2002 ~]$ kubectl get jobs --field-selector status.successful=0 -l team=growth
NAME                                                          STATUS    COMPLETIONS   DURATION   AGE
growthexperiments-fixlinkrecommendationdata-dryrun-29465720   Running   0/1           96m        96m
growthexperiments-listtaskcounts-29465771                     Running   0/1           45m        45m
growthexperiments-refreshlinkrecommendations-s3-29465427      Running   0/1           6h29m      6h29m
growthexperiments-refreshlinkrecommendations-s5-29465787      Running   0/1           14m        14m
growthexperiments-updatementeedata-s1-29460615                Failed    0/1           3d14h      3d14h
[urbanecm@deploy2002 ~]$ kubectl logs job/growthexperiments-updatementeedata-s1-29460615 mediawiki-main-app
unable to retrieve container logs for containerd://ddfbb3217271b9f6fbabd6bbed0a71798c0e1651f9d90949601b7d38ce6519fb[urbanecm@deploy2002 ~]$

This leaves me unable to figure out what actually happened, as I do not have access to the logs anymore. While I understand the need to purge logs, I don't understand:

  • why we need to do it so quickly (3 days after the job fails),
  • why we continue to send alerts, despite the failure logs were discarded (and are inaccessible using the documented methods)

Would it be possible to sync job deletion timeline with the timeline for dropping the job records? That way, the alert would be actionable when it exists (and continues to fire).

Event Timeline

I noticed Logstash has the error logs. They are very hard to work with (every line of traceback has its own entry and they are not in order), see screenshot:

image.png (1×3 px, 603 KB)

This makes it at least easy to conclude what happened: the maintenance script failed to connect to etcd due to a timeout. Which makes me re-realise another issue with mw-cron failure reporting: most of the alerts we receive are unactionable for us, because they are infrastructure-driven rather than code-driven. Would it be possible to either reduce the frequency of such errors, or to somehow hide them from teams focusing on MediaWiki itself?

Mentioned in SAL (#wikimedia-operations) [2026-01-09T09:07:18Z] <urbanecm> [urbanecm@deploy2002 ~]$ kubectl delete job/growthexperiments-updatementeedata-s1-29460615 # T414167

Since the error is now investigated, I manually deleted it to reset the alerting.

Blake triaged this task as Low priority.Feb 10 2026, 2:14 PM
Blake edited projects, added ServiceOps new; removed serviceops-deprecated.
Blake moved this task from Inbox to Backlog on the ServiceOps new board.