I ran the toolforge-jobs list -l command on tools-sgebastion-10 and it says last scheduled time was today around 4:30 PM UTC. New scheduled jobs are also stuck.
Description
Event Timeline
@Sakretsu for which tool and what were the expected results? Your tool is one of over 3000 running in Toolforge so the admins there will need more context to know where to even start looking for issues.
The tools I checked are "itwiki" and "omegabot". As for the issue, basically this is no longer working.
itwiki:
$ kubectl get cronjob NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE itwiki-archivebot 0 22,23 * * * False 0 22h 36d itwiki-cleansandbox 23 */2 * * * False 0 5h11m 36d itwiki-datiwikipedie 21 2 * * * False 0 19h 36d itwiki-deletionbot */8 * * * * False 1 5h2m 34d itwiki-draftbot 0,30 * * * * False 0 5h4m 35d itwiki-lavorosporco 0 3 28-31 * * False 0 11d 36d itwiki-listdrafts 15,45 * * * * False 0 5h19m 36d itwiki-markadmins 5 * * * * False 0 5h29m 34d itwiki-orphanizerbot */3 * * * * False 0 5h4m 34d itwiki-wikiquotestats 58 21 * * * False 0 23h 36d
omegabot:
$ kubectl get cronjob NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE archive-reports 30 2 * * * False 0 19h 36d draft-categories 0 22,23 * * * False 0 22h 36d move-noticeboard 0 22,23 28-31 * * False 0 10d 36d touch-drafts 0 3 * * 3 False 0 18h 36d welcome 0 * * * * False 0 5h35m 36d
It looks like things stopped working as expected around 2022-05-11 16:00 UTC for both tools.
One off jobs seem to be working generally (which was my first fear):
$ become stashbot $ toolforge-jobs run t308189-test --command date --image tf-php74 $ cat t308189-test.* Wed 11 May 2022 09:56:42 PM UTC
Generally scheduled jobs seem to be not working:
$ become stashbot $ toolforge-jobs run t308189-minute --command date --image tf-php74 --schedule "* * * * *" $ # wander off for a bit $ kubectl get cronjobs NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE t308189-minute * * * * * False 0 <none> 5m54s
Per kubectl get cronjobs --as-group=system:masters --as=admin --all-namespaces, no scheduled cronjob has fired since the cluster was upgraded to v1.21.8 (starting around 2022-05-11 16:00 UTC).
For what it's worth and may or may not be related, the job sambot on my tool sam failed and restarted earlier. It is a continuous job.
+------------+------------------------------------------------------------------------+ | Hints: | Last run at 2022-05-11T17:43:59Z. Pod in 'Running' phase. Pod has been | | | restarted 1 times. State 'running'. Started at '2022-05-11T17:44:40Z'. | +------------+------------------------------------------------------------------------+
Editing the config for the kube-controller-manager and disabling the CronjobControllerv2 flag has gotten the jobs triggering again.
Specifically, editing /etc/kubernetes/manifests/kube-controller-manager.yaml on each of the 3 controllers and replacing:
- --feature-gates=TTLAfterFinished=true
with:
- --feature-gates=CronJobControllerV2=false,TTLAfterFinished=true
that updates the kube-controller-pods right away (no need to run anything else).
Now we are seeing some jobs that are skipped because they missed too many triggers:
root@tools-k8s-control-2:/etc/kubernetes# kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager --tail 1000 | grep -i cron ... E0511 23:33:09.914018 1 cronjob_controller.go:269] Cannot determine if tool-jarbot-ii/wpnsb500r needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew ...
Looking...
Lowering priority from UBN! to High. We have more investigation to do, but there are at least some jobs being scheduled and run now. The once per minute test job I setup in T308189#7922708 started firing at 2022-05-11 23:29:11 UTC and has continued since with a few missed intervals.
For the cronjobs that don't get started due to the "too many missed start time" error, following
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations
you have to add .spec.startingDeadlineSeconds: 15 to the cronjob definition (anything above 10).
I have been using kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager -f | grep -i cron | grep -v 'type="Normal"' to look for tools that are having issues.
Follow up suggestion: WMCS should have an icinga alert or other paging watch on kubernetes cronjobs now that we are advocating for folks to use them over grid engine crons.
Everything seems to be working again properly. I've filed some actionables and marked those as subtasks, so closing this task.
Sorry for the disruption!
There's a new report about this at T308300: toolforge-jobs: Kubernetes scheduler overloaded at 00:00 each day causing missed jobs.