Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

None

Authored By

	Sakretsu
	May 11 2022, 9:05 PM

Description

I ran the toolforge-jobs list -l command on tools-sgebastion-10 and it says last scheduled time was today around 4:30 PM UTC. New scheduled jobs are also stuck.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T362869 [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30)
Open		None	T362868 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29
Open		None	T362867 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.28
Open		None	T359641 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.27
Open		None	T327025 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.26
Open		None	T316107 [infra,k8s] Upgrade Toolforge Kubernetes to version 1.25
Resolved		aborrero	T307651 Upgrade Toolforge Kubernetes to version 1.24
Resolved		taavi	T298005 Upgrade Toolforge Kubernetes to version 1.23
Resolved		taavi	T286856 Upgrade Toolforge Kubernetes to latest 1.22
Resolved		rook	T308172 Upgrade PAWS to Kubernetes 1.21
Resolved		taavi	T282942 Upgrade Toolforge Kubernetes to latest 1.21
Resolved	BUG REPORT	None	T308189 Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade
Open		None	T308203 Create an alert for K8s cronjobs
Resolved		taavi	T308204 toolforge-jobs should set startingDeadlineSeconds by default
Resolved		taavi	T308205 Re-enable CronJobControllerV2
Open		None	T308381 toolforge: Scrape Kubernetes controller-manager and apiserver metrics into Prometheus

Event Timeline

Sakretsu created this task.May 11 2022, 9:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 11 2022, 9:05 PM

@Sakretsu for which tool and what were the expected results? Your tool is one of over 3000 running in Toolforge so the admins there will need more context to know where to even start looking for issues.

The tools I checked are "itwiki" and "omegabot". As for the issue, basically this is no longer working.

itwiki:

$ kubectl get cronjob
NAME                    SCHEDULE        SUSPEND   ACTIVE   LAST SCHEDULE   AGE
itwiki-archivebot       0 22,23 * * *   False     0        22h             36d
itwiki-cleansandbox     23 */2 * * *    False     0        5h11m           36d
itwiki-datiwikipedie    21 2 * * *      False     0        19h             36d
itwiki-deletionbot      */8 * * * *     False     1        5h2m            34d
itwiki-draftbot         0,30 * * * *    False     0        5h4m            35d
itwiki-lavorosporco     0 3 28-31 * *   False     0        11d             36d
itwiki-listdrafts       15,45 * * * *   False     0        5h19m           36d
itwiki-markadmins       5 * * * *       False     0        5h29m           34d
itwiki-orphanizerbot    */3 * * * *     False     0        5h4m            34d
itwiki-wikiquotestats   58 21 * * *     False     0        23h             36d

omegabot:

$ kubectl get cronjob
NAME               SCHEDULE            SUSPEND   ACTIVE   LAST SCHEDULE   AGE
archive-reports    30 2 * * *          False     0        19h             36d
draft-categories   0 22,23 * * *       False     0        22h             36d
move-noticeboard   0 22,23 28-31 * *   False     0        10d             36d
touch-drafts       0 3 * * 3           False     0        18h             36d
welcome            0 * * * *           False     0        5h35m           36d

It looks like things stopped working as expected around 2022-05-11 16:00 UTC for both tools.

bd808 renamed this task from Toolforge jobs stopped getting scheduled to Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade.May 11 2022, 9:50 PM

bd808 triaged this task as High priority.

bd808 added a project: cloud-services-team (Kanban).

bd808 added subscribers: taavi, dcaro.

One off jobs seem to be working generally (which was my first fear):

$ become stashbot
$ toolforge-jobs run t308189-test --command date --image tf-php74
$ cat t308189-test.*
Wed 11 May 2022 09:56:42 PM UTC

Generally scheduled jobs seem to be not working:

$ become stashbot
$ toolforge-jobs run t308189-minute --command date --image tf-php74 --schedule "* * * * *"
$ # wander off for a bit
$ kubectl get cronjobs
NAME             SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
t308189-minute   * * * * *   False     0        <none>          5m54s

Per kubectl get cronjobs --as-group=system:masters --as=admin --all-namespaces, no scheduled cronjob has fired since the cluster was upgraded to v1.21.8 (starting around 2022-05-11 16:00 UTC).

TheresNoTime subscribed.May 11 2022, 10:28 PM

For what it's worth and may or may not be related, the job sambot on my tool sam failed and restarted earlier. It is a continuous job.

+------------+------------------------------------------------------------------------+
| Hints:     | Last run at 2022-05-11T17:43:59Z. Pod in 'Running' phase. Pod has been |
|            | restarted 1 times. State 'running'. Started at '2022-05-11T17:44:40Z'. |
+------------+------------------------------------------------------------------------+

Editing the config for the kube-controller-manager and disabling the CronjobControllerv2 flag has gotten the jobs triggering again.
Specifically, editing /etc/kubernetes/manifests/kube-controller-manager.yaml on each of the 3 controllers and replacing:

- --feature-gates=TTLAfterFinished=true

with:

- --feature-gates=CronJobControllerV2=false,TTLAfterFinished=true

that updates the kube-controller-pods right away (no need to run anything else).

Now we are seeing some jobs that are skipped because they missed too many triggers:

root@tools-k8s-control-2:/etc/kubernetes# kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager --tail 1000 | grep -i cron
...
E0511 23:33:09.914018       1 cronjob_controller.go:269] Cannot determine if tool-jarbot-ii/wpnsb500r needs to be started: too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew
...

Looking...

Lowering priority from UBN! to High. We have more investigation to do, but there are at least some jobs being scheduled and run now. The once per minute test job I setup in T308189#7922708 started firing at 2022-05-11 23:29:11 UTC and has continued since with a few missed intervals.

For the cronjobs that don't get started due to the "too many missed start time" error, following
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#cron-job-limitations

you have to add .spec.startingDeadlineSeconds: 15 to the cronjob definition (anything above 10).

I have been using kubectl logs --as-group=system:masters --as=admin -n kube-system -l component=kube-controller-manager -f | grep -i cron | grep -v 'type="Normal"' to look for tools that are having issues.

Follow up suggestion: WMCS should have an icinga alert or other paging watch on kubernetes cronjobs now that we are advocating for folks to use them over grid engine crons.

Lomrjyo subscribed.May 12 2022, 3:31 AM

taavi added a parent task: T282942: Upgrade Toolforge Kubernetes to latest 1.21.May 12 2022, 5:18 AM

taavi mentioned this in T308203: Create an alert for K8s cronjobs.May 12 2022, 5:27 AM

taavi closed subtask T308205: Re-enable CronJobControllerV2 as Resolved.May 12 2022, 12:40 PM