Page MenuHomePhabricator

toolforge-jobs: Kubernetes scheduler overloaded at 00:00 each day causing missed jobs
Closed, ResolvedPublicBUG REPORT

Description

tools.giftbot@tools-sgebastion-10:~$ date # 2022-05-13T00:36:36Z
Fri 13 May 2022 12:36:36 AM UTC
tools.giftbot@tools-sgebastion-10:~$ toolforge-jobs list
Job name:        Job type:                    Status:
---------------  ---------------------------  ----------------------------------------
adt              schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
ausrufer         schedule: 0 22,23 * * 0      Last schedule time: 2022-05-08T23:00:00Z
autoarchiv-n     schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
autoarchiv-q     schedule: 0 1,2,10,11 * * *  Last schedule time: 2022-05-12T01:00:00Z
autoarchiv-s     schedule: 0 1,2,10,11 * * *  Last schedule time: 2022-05-12T01:00:00Z
autoarchiv-wikt  schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
check            schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
checknotdead     schedule: 0 0 1 * *          Last schedule time: 2022-05-01T00:00:00Z
daysection       schedule: 0 22,23 * * *      Last schedule time: 2022-05-11T23:00:00Z
einladung-jwp    schedule: 0 0 * * 6          Last schedule time: 2022-05-07T00:00:00Z
ibchem           schedule: 0 0 1 * *          Last schedule time: 2022-05-01T00:00:00Z
inaktivebots     schedule: 0 0 1 * *          Last schedule time: 2022-05-01T00:00:00Z
kla              schedule: 0 0 28-31 * *      Last schedule time: 2022-04-30T00:00:00Z
kurzeartikel     schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
listunreviewed   schedule: 0 0 * * *          Last schedule time: 2022-05-13T00:00:00Z
mineralbilder    schedule: 0 0 1 */3 *        Last schedule time: 2022-04-01T00:00:00Z
picdwb           schedule: 0 22,23 * * *      Last schedule time: 2022-05-11T23:00:00Z
rue              schedule: 0 22,23 * * *      Last schedule time: 2022-05-12T22:00:00Z
sg               schedule: 0 22,23 * * *      Last schedule time: 2022-05-11T23:00:00Z
siku             schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
unreviewed       schedule: 0 0,12 * * *       Last schedule time: 2022-05-13T00:00:00Z
unreviewedmoves  schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
updateuv         schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
wkdezb-als       schedule: 0 18 * * 1-5       Last schedule time: 2022-05-11T18:00:00Z
wkdezb-de        schedule: 0 18 * * 1-5       Last schedule time: 2022-05-12T18:00:00Z
wkdezb-frr       schedule: 0 18 * * 1-5       Last schedule time: 2022-05-11T18:00:00Z
wpbvk            schedule: 0 0 * * *          Last schedule time: 2022-05-12T00:00:00Z
gva              continuous                   Running
gvm              continuous                   Running
mg               continuous                   Running
sga              continuous                   Running
vm               continuous                   Running
tools.giftbot@tools-sgebastion-10:~$ toolforge-jobs show daysection
+------------+------------------------------------------+
| Job name:  | daysection                               |
+------------+------------------------------------------+
| Command:   | ./check-timezone 0 ./daysection.tcl      |
+------------+------------------------------------------+
| Job type:  | schedule: 0 22,23 * * *                  |
+------------+------------------------------------------+
| Image:     | tf-tcl86                                 |
+------------+------------------------------------------+
| File log:  | yes                                      |
+------------+------------------------------------------+
| Emails:    | onfailure                                |
+------------+------------------------------------------+
| Resources: | default                                  |
+------------+------------------------------------------+
| Status:    | Last schedule time: 2022-05-11T23:00:00Z |
+------------+------------------------------------------+
| Hints:     | No pods were created for this job.       |
+------------+------------------------------------------+
tools.giftbot@tools-sgebastion-10:~$ grep -A1 -- ---- daysection.out | tail -n1
10.05.2022 22:00:24 +0000

The jobs adt, autoarchiv-n, autoarchiv-wikt, check, daysection, kurzeartikel, picdwb, rue, sg, siku, unreviewedmoves, updateuv, wkdezb-als, wkdezb-frr, wpbvk should have been scheduled after their last schedule time but they weren't.

Event Timeline

My understanding from T308189: Toolforge jobs stopped getting scheduled around the same time as the Toolforge k8s cluster upgrade was that this issue was fixed, but it's possible something was missed. I've flagged this to the other Toolforge admins who worked on that.

taavi triaged this task as High priority.
taavi subscribed.

Looks like the Kubernetes cronjob scheduler may be getting overloaded at midnight given how many tools are running jobs at that point. I've increased the scheduling tolerance of your jobs to make Kubernetes start your jobs even if that means it'll be a little off the scheduled date. However, a better solution would be to, if possible, run hourly/daily jobs at a random time (say, 18:37 daily) instead of exactly midnight or top of the hour.

I have flushed the jobs and rescheduled most of them on the 7th minute, like I did for sge. I hope that helps with the load (if it is that).

Hi, I have the same issue, most of my jobs last schedule time was 2022-05-12.

bd808 renamed this task from toolforge-jobs: scheduled jobs stopped being scheduled to toolforge-jobs: Kubernetes scheduler overloaded at 00:00 each day causing missed jobs.Dec 13 2022, 5:03 PM
bd808 added a project: Kubernetes.
bd808 added subscribers: bd808, Anomie.

Looks like the Kubernetes cronjob scheduler may be getting overloaded at midnight given how many tools are running jobs at that point.

Is the Pod scheduler actually what is being overloaded? I'm not aware of us adding any extra capacity to the Kubernetes cluster after rolling out the jobs service and then prompting all grid engine users to migrate to Kubernetes at their earliest convenience. Are we at a point where we should add some additional Kubernetes worker nodes (and maybe decomm some grid nodes)?

I'm fairly sure this is no longer a problem after we added more resources to the cluster control plane, and a few spot checks seem to confirm that. Please re-open if I'm wrong.