[21:19] < anomie> `toolforge-jobs show send-daily-report` is telling me that the job that's supposed to be run daily hasn't run since the 10th. [21:19] < anomie> (for the anomiebot tool) [21:37] < bd808> anomie: `kubectl get cronjobs` and `kubectl describe cronjobs/send-daily-report` agree with that. That's only comforting in that the issue seems to be not in the toolforge-jobs API but instead in something in the kubernetes cluster itself or your namespace. [21:38] bd808 tries to find events explaining why the job failed to schedule [21:51] < wm-bot> !log tools.anomiebot <root> Changed k8s cronjob send-daily-report schedule to "3 0 * * *" to see if this makes the job run again. [21:51] < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL [21:51] < anomie> Theory being it doesn't like exactly midnight? [21:53] < bd808> anomie: yeah, that's my only random guess at this point. I haven't found any "events" saying that the cronjob object itself is busted. My random guess is that too many things are scheduled for midnight and that's causing this missed job as a side effect. [21:56] < bd808> anomie: I did the edit via `kubectl edit cronjobs send-daily-report` as your tool. That opens up the yaml describing the cronjob object in an editor and then updates the k8s state when you save and exit the editor.
Description
Description
Related Objects
Related Objects
Event Timeline
Comment Actions
I assume this is generally a Kubernetes bug of some sort rather than a bug in the jobs framework API. Since we are trying to hide things a bit with the jobs framework API however it seemed reasonable to report it here.
Based on some spot checking of kubectl sudo get cronjobs -A output I think this is happening for more than just anomiebot:
NAMESPACE NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE tool-anticompositebot anticompositebot.catwatch 0 0 * * * False 0 3d22h 2y167d tool-borkedbot borkedbot.bot-fandom 0 0 * * 4 False 0 11d 590d tool-botriconferme botriconferme-purge-log 0 0 1 * * False 0 41d 214d tool-citationhunt citationhunt-update-af 0 0 1-31/4 * * False 0 <none> 217d tool-covidbot covidbot 00 00 * * * False 0 5d22h 405d tool-datbot wikiproject 0 00 * * * False 0 4d22h 29d tool-jarbot l111 0 0 * * * False 0 2d22h 25d tool-musikbot rotate-tdyk 0 0 * * * False 0 2d22h 68d tool-pickme pickme 0 0 * * * False 1 503d 504d
pickme looks like a separate problem (job that never ends?).
Comment Actions
This is essentially a duplicate of T308300: toolforge-jobs: Kubernetes scheduler overloaded at 00:00 each day causing missed jobs. The fix is to not run your jobs at midnight if possible.