Page MenuHomePhabricator

Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade
Closed, ResolvedPublicBUG REPORT

Description

I ran the toolforge-jobs list -l command on tools-sgebastion-10 and it says last scheduled time was today around 4:30 PM UTC. New scheduled jobs are also stuck.

Event Timeline

@Sakretsu for which tool and what were the expected results? Your tool is one of over 3000 running in Toolforge so the admins there will need more context to know where to even start looking for issues.

The tools I checked are "itwiki" and "omegabot". As for the issue, basically this is no longer working.

itwiki:

$ kubectl get cronjob
NAME                    SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
itwiki-archivebot       0 22,23 * * *   False     0        22h             36d
itwiki-cleansandbox     23 */2 * * *    False     0        5h11m           36d
itwiki-datiwikipedie    21 2 * * *      False     0        19h             36d
itwiki-deletionbot      */8 * * * *     False     1        5h2m            34d
itwiki-draftbot         0,30 * * * *    False     0        5h4m            35d
itwiki-lavorosporco     0 3 28-31 * *   False     0        11d             36d
itwiki-listdrafts       15,45 * * * *   False     0        5h19m           36d
itwiki-markadmins       5 * * * *       False     0        5h29m           34d
itwiki-orphanizerbot    */3 * * * *     False     0        5h4m            34d
itwiki-wikiquotestats   58 21 * * *     False     0        23h             36d

omegabot:

$ kubectl get cronjob
NAME               SCHEDULE            SUSPEND   ACTIVE   LAST SCHEDULE   AGE
archive-reports    30 2 * * *          False     0        19h             36d
draft-categories   0 22,23 * * *       False     0        22h             36d
move-noticeboard   0 22,23 28-31 * *   False     0        10d             36d
touch-drafts       0 3 * * 3           False     0        18h             36d
welcome            0 * * * *           False     0        5h35m           36d

It looks like things stopped working as expected around 2022-05-11 16:00 UTC for both tools.

bd808 renamed this task from Toolforge jobs stopped getting scheduled to Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade.May 11 2022, 9:50 PM
bd808 triaged this task as High priority.
bd808 added subscribers: taavi, dcaro.

One off jobs seem to be working generally (which was my first fear):

$ become stashbot
$ toolforge-jobs run t308189-test --command date --image tf-php74
$ cat t308189-test.*
Wed 11 May 2022 09:56:42 PM UTC

Generally scheduled jobs seem to be not working:

$ become stashbot
$ toolforge-jobs run t308189-minute --command date --image tf-php74 --schedule "* * * * *"
$ # wander off for a bit
$ kubectl get cronjobs
NAME             SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
t308189-minute   * * * * *   False     0        <none>          5m54s 
bd808 raised the priority of this task from High to Unbreak Now!.May 11 2022, 10:21 PM

Per kubectl get cronjobs --as-group=system:masters --as=admin --all-namespaces, no scheduled cronjob has fired since the cluster was upgraded to v1.21.8 (starting around 2022-05-11 16:00 UTC).

For what it's worth and may or may not be related, the job sambot on my tool sam failed and restarted earlier. It is a continuous job.

+------------+------------------------------------------------------------------------+
| Hints:     | Last run at 2022-05-11T17:43:59Z. Pod in 'Running' phase. Pod has been |
|            | restarted 1 times. State 'running'. Started at '2022-05-11T17:44:40Z'. |
+------------+------------------------------------------------------------------------+

Editing the config for the kube-controller-manager and disabling the CronjobControllerv2 flag has gotten the jobs triggering again.
Specifically, editing /etc/kubernetes/manifests/kube-controller-manager.yaml on each of the 3 controllers and replacing:

- --feature-gates=TTLAfterFinished=true

with:

- --feature-gates=CronJobControllerV2=false,TTLAfterFinished=true

that updates the kube-controller-pods right away (no need to run anything else).

Now we are seeing some jobs that are skipped because they missed too many triggers:

root@tools-k8s-control-2:/etc/kubernetes# kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager --tail 1000 | grep -i cron
...
E0511 23:33:09.914018       1 cronjob_controller.go:269] Cannot determine if tool-jarbot-ii/wpnsb500r needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
...

Looking...

bd808 lowered the priority of this task from Unbreak Now! to High.May 11 2022, 11:47 PM

Lowering priority from UBN! to High. We have more investigation to do, but there are at least some jobs being scheduled and run now. The once per minute test job I setup in T308189#7922708 started firing at 2022-05-11 23:29:11 UTC and has continued since with a few missed intervals.

For the cronjobs that don't get started due to the "too many missed start time" error, following
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations

you have to add .spec.startingDeadlineSeconds: 15 to the cronjob definition (anything above 10).

I have been using kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager -f | grep -i cron | grep -v 'type="Normal"' to look for tools that are having issues.

Follow up suggestion: WMCS should have an icinga alert or other paging watch on kubernetes cronjobs now that we are advocating for folks to use them over grid engine crons.

Everything seems to be working again properly. I've filed some actionables and marked those as subtasks, so closing this task.

Sorry for the disruption!