Today, I discovered Growth-Team received several MediaWikiCronJobFailed alerts. Following the manual at https://wikitech.wikimedia.org/wiki/Mw-cron_jobs#Troubleshooting, I checked the status:
[urbanecm@deploy2002 ~]$ kubectl get jobs --field-selector status.successful=0 -l team=growth NAME STATUS COMPLETIONS DURATION AGE growthexperiments-fixlinkrecommendationdata-dryrun-29465720 Running 0/1 96m 96m growthexperiments-listtaskcounts-29465771 Running 0/1 45m 45m growthexperiments-refreshlinkrecommendations-s3-29465427 Running 0/1 6h29m 6h29m growthexperiments-refreshlinkrecommendations-s5-29465787 Running 0/1 14m 14m growthexperiments-updatementeedata-s1-29460615 Failed 0/1 3d14h 3d14h [urbanecm@deploy2002 ~]$ kubectl logs job/growthexperiments-updatementeedata-s1-29460615 mediawiki-main-app unable to retrieve container logs for containerd://ddfbb3217271b9f6fbabd6bbed0a71798c0e1651f9d90949601b7d38ce6519fb[urbanecm@deploy2002 ~]$
This leaves me unable to figure out what actually happened, as I do not have access to the logs anymore. While I understand the need to purge logs, I don't understand:
- why we need to do it so quickly (3 days after the job fails),
- why we continue to send alerts, despite the failure logs were discarded (and are inaccessible using the documented methods)
Would it be possible to sync job deletion timeline with the timeline for dropping the job records? That way, the alert would be actionable when it exists (and continues to fire).
