Page MenuHomePhabricator

[jobs-api,jobs-cli,infra] Transient cronjob scheduling failures on Toolforge k8s
Open, MediumPublicBUG REPORT

Description

The Grid Engine crontab used to be really stable with no failures in booting up jobs. However, with the k8s migration, I am receiving multiple reports from users about bot reports randomly not getting generated - twice within last 10 days.

From 22 May: job-g13-elig has last schedule of 2023-05-21 instead of 2023-05-22 like the other jobs with same cron schedule.

Screenshot_2023-05-22_at_10.10.48_AM.png (246×1 px, 99 KB)

From today (2 June): job-g13-1week is skipped for a day and job-g13-elig was skipped twice!

Screenshot 2023-06-02 at 8.29.40 AM.png (294×1 px, 108 KB)

In all cases, no emails were received (--emails onfailure was configured). And since the jobs didn't start at all, the error logs don't contain anything.

All jobs use only 256 Mi memory and the default 500m CPU allocation (cpu limit is 5000m for this tool) so resource limits shouldn't be getting hit.

jobs.yml file: https://github.com/siddharthvp/SDZeroBot/blob/master/jobs.yml

Event Timeline

Same issue today (3 June). Both job-g131week and job-g13-elig were skipped again.

Screenshot 2023-06-03 at 10.51.15 AM.png (268×1 px, 114 KB)

And both reports were skipped again tonight for the 4th day in a row. We work with a 7 day lag between when the G13 soon report is issued and when action is required and we're now down to a three day supply of reports until we are really in trouble.

This seems like a re-occurance of T308300: toolforge-jobs: Kubernetes scheduler overloaded at 00:00 each day causing missed jobs. The tl;dr of the problem is that Kubernetes can't start all of the jobs scheduled for midnight at exactly midnight since the scheduler can't take them all at once, and at some point it thinks it's been too long from the original scheduled time and skips that execution. I've added a couple of subtasks that we can do on the infrastructure side to avoid this from happening, but the best fix at the moment is to not schedule jobs at exactly midnight or top of the hour when the scheduler is less busy.

So I've seen what I suspect is the same issue, the potd tool's "send" job was never triggered on 2024-01-06 at 2:00; and the tfaprotbot tool's "tfasemibot" job was never triggered on 2023-12-28 at 23:00. If the problem is running on the hour, how much should I offset by? Is it likely to into issues if I have it run at, e.g. 22:59? I note that T338134: Use a higher `startingDeadlineSeconds` for less frequent jobs is still open - is that worth pursuing still? Should there be a subtask about non-scheduled jobs not sending any notifications as well?

The tfasemibot job is idempotent so I'm could just up the frequency, but that seems like it would make the overall problem worse by needlessly adding load to the cluster, no?

So I've seen what I suspect is the same issue, the potd tool's "send" job was never triggered on 2024-01-06 at 2:00; and the tfaprotbot tool's "tfasemibot" job was never triggered on 2023-12-28 at 23:00. If the problem is running on the hour, how much should I offset by? Is it likely to into issues if I have it run at, e.g. 22:59? I note that T338134: Use a higher `startingDeadlineSeconds` for less frequent jobs is still open - is that worth pursuing still?

Yes, I think that's still a good idea. The cron parsing added for T331684: Provide a means to introduce "skew" for scheduled jobs to avoid thundering herd problems should make it easier to implement.

Should there be a subtask about non-scheduled jobs not sending any notifications as well?

Not sure - jobs-emailer is in need of some love but even then sending notifications about things that did not happen is going to be quite difficult.

dcaro renamed this task from Transient cronjob scheduling failures on Toolforge k8s to [jobs-api,jobs-cli,infra] Transient cronjob scheduling failures on Toolforge k8s.Mar 11 2024, 2:21 PM
dcaro triaged this task as Medium priority.
dcaro removed a project: Toolforge Jobs framework.
dcaro added a project: Toolforge.
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.