Page MenuHomePhabricator

Kubernetes CronJobs scheduled for midnight not firing consistently
Closed, DuplicatePublicBUG REPORT

Description

[21:19]  <   anomie> `toolforge-jobs show send-daily-report` is telling me that the job that's supposed to be run daily hasn't run since the 10th.
[21:19]  <   anomie> (for the anomiebot tool)
[21:37]  <    bd808> anomie: `kubectl get cronjobs` and `kubectl describe cronjobs/send-daily-report` agree with that. That's only comforting in that the issue seems to be not in the toolforge-jobs API but instead in something in the kubernetes cluster itself or your namespace.
[21:38] bd808 tries to find events explaining why the job failed to schedule
[21:51]  <   wm-bot> !log tools.anomiebot <root> Changed k8s cronjob send-daily-report schedule to "3 0 * * *" to see if this makes the job run again.
[21:51]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.anomiebot/SAL
[21:51]  <   anomie> Theory being it doesn't like exactly midnight?
[21:53]  <    bd808> anomie: yeah, that's my only random guess at this point. I haven't found any "events" saying that the cronjob object itself is busted. My random guess is that too many things are scheduled for midnight and that's causing this missed job as a side effect.
[21:56]  <    bd808> anomie: I did the edit via `kubectl edit cronjobs send-daily-report` as your tool. That opens up the yaml describing the cronjob object in an editor and then updates the k8s state when you save and exit the editor.

Event Timeline

I assume this is generally a Kubernetes bug of some sort rather than a bug in the jobs framework API. Since we are trying to hide things a bit with the jobs framework API however it seemed reasonable to report it here.

Based on some spot checking of kubectl sudo get cronjobs -A output I think this is happening for more than just anomiebot:

NAMESPACE                      NAME                                       SCHEDULE                       SUSPEND   ACTIVE   LAST SCHEDULE   AGE
tool-anticompositebot          anticompositebot.catwatch                  0 0 * * *                      False     0        3d22h           2y167d
tool-borkedbot                 borkedbot.bot-fandom                       0 0 * * 4                      False     0        11d             590d
tool-botriconferme             botriconferme-purge-log                    0 0 1 * *                      False     0        41d             214d
tool-citationhunt              citationhunt-update-af                     0 0 1-31/4 * *                 False     0        <none>          217d
tool-covidbot                  covidbot                                   00 00 * * *                    False     0        5d22h           405d
tool-datbot                    wikiproject                                0 00 * * *                     False     0        4d22h           29d
tool-jarbot                    l111                                       0 0 * * *                      False     0        2d22h           25d
tool-musikbot                  rotate-tdyk                                0 0 * * *                      False     0        2d22h           68d
tool-pickme                    pickme                                     0 0 * * *                      False     1        503d            504d

pickme looks like a separate problem (job that never ends?).