[jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime
Open, In Progress, MediumPublicFeature
Actions

Assigned To

Authored By

	MusikAnimal
	Apr 18 2022, 8:47 PM

Description

Some bots on Toolforge such as eranbot, which powers CopyPatrol, run for a long time and can get "stuck". Our solution, albeit admittedly hacky, has been to set a maximum run time on the cron job with jsub … -l h_rt=4:05:00 -once …. This way it will automatically die after 4 hours and 5 minutes, then 5 minutes later it starts up again.

This feature appears to be missing in the new Toolforge jobs framework. Asking on IRC, I was informed Kubernetes has an option for that (.spec.activeDeadlineSeconds) and this could possibly be exposed in the wrapper API.

Additionally, we also need something similar to the -once flag, but apparently the .spec.parallelism option is set to 1 by default so this may effectively already be the case with the Toolforge jobs framework. However some may wish to increase this number, or have the concurrency policy changed, so that may be worth implementing in the wrapper API as well.

See T376099: --timeout flag for mwscript-k8s for a similar feature request in a different Kubernetes environment. Here a timeout option using activeDeadlineSeconds in Kubernetes was implemented. That option is currently missing in Toolforge job submission.

I used to use this a lot on the grid to automatically have jobs killed that got stuck. Having this option in the job submission would make that possible again.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	None	T379132 chie-bot: Jobs hang on toolforge
In Progress	Feature	dcaro	T306391 [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime
Resolved		dcaro	T362093 [infra] Add alert when workers have a sustained large amount of D processes

Event Timeline

MusikAnimal created this task.Apr 18 2022, 8:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2022, 8:47 PM

bd808 added a parent task: T285944: Toolforge: beta phase for the new jobs framework.Apr 18 2022, 8:52 PM

bd808 changed the subtype of this task from "Task" to "Feature Request".

AmandaNP subscribed.Apr 18 2022, 9:08 PM

MusikAnimal mentioned this in T306888: Migrate ERANBOT project off of Grid Engine.Apr 27 2022, 9:49 PM

eranroz subscribed.Apr 30 2022, 7:32 PM

What if you used a bash wrapper script that invoked your real command using timeout (see https://manpages.debian.org/buster/coreutils/timeout.1.en.html)?

The one thing I didn't verify is if toolforge-jobs will indefinitely restart a pod that fails/exits, or whether there's a max retry limit.

It will retry once on failure.

If it retries, I'm guessing that means it's another N hours before it auto-kills again. I suppose I can just cut the timeout in half to compensate. Alternatively, we might be able to have a separate job that runs toolforge-jobs delete [job name]?

bd808 edited projects, added Toolforge; removed Cloud-Services.May 5 2022, 11:02 PM

I think it would be simpler to manage to handle retries inside your bash wrapper script. So inside a loop, it invokes the process with timeout, catches the eventual timeout, then loops, invoking the process again, etc. all without k8s/toolforge-jobs ever thinking that the job failed or exited.

bd808 edited projects, added Toolforge Jobs framework; removed Toolforge.May 31 2022, 8:02 PM

In T306391#7908553, @Legoktm wrote:

I think it would be simpler to manage to handle retries inside your bash wrapper script. So inside a loop, it invokes the process with timeout, catches the eventual timeout, then loops, invoking the process again, etc. all without k8s/toolforge-jobs ever thinking that the job failed or exited.

So I attempted this and it definitely helped. The bot ran without issue for about two weeks, auto-restarting every 4 hours. Then over this past weekend it decided to die for good. The strange thing is it still showed up as "Running" when I ran toolforge-jobs list. I re-loaded my jobs.yaml file and all is well now, but I'm wondering if there's something I did wrong or could improve with my wrapper script:

export PATH="/data/project/musikbot/.gem/ruby/2.7.0/bin:$PATH"
cd "$HOME/MusikBot/tasks"
while true
do
timeout 4h ruby abusefilter_irc.rb --no-api
sleep 45
done

Does anything look obviously wrong here? The sleep 45 by the way is to give time for the bot to logout of the IRC server. Before when it auto-restarted immediately, there would be a brief time when MusikBot was already logged in (and soon to logout), so the restarted bot showed up as MusikBot_.

The abusefilter-irc.err file did not have any meaningful output, nor did the .out file. I did however receive an email when I reloaded my jobs config, and it reported exit code 137, which I believe indicates the k8s pod ran out of memory. I'm already giving it 1 GB of RAM. It's possible it did run out of memory, but that's what the auto-restarts are supposed to fix... So my guess is the wrapper script (which is still invoked with toolforge-jobs) maintains the process and doesn't free up the RAM that was used, or something? Also mind you again I didn't receive the email until I forced-killed the job with toolforge-jobs load jobs.yaml.

Any ideas? This is still better than nothing as the bot definitely ran longer than it was before I added the wrapper script.

In T306391#8146790, @MusikAnimal wrote:

It's possible it did run out of memory, but that's what the auto-restarts are supposed to fix...

The bash script is running within the pod, so when the pod runs out of memory, the bash script dies as well. But such an OOM kill may not cause the pod to be marked as failed (see T246523), which is why it still shows up as running.

I suggest implementing health checks and automated restarts as a separate cronjob. (But as of now such a cronjob can't be run on k8s as container images don't have toolforge-jobs - opened T315729 regarding this).

aborrero removed a parent task: T285944: Toolforge: beta phase for the new jobs framework.Jan 24 2023, 4:40 PM

hoshimi subscribed.Feb 1 2023, 3:14 PM

AntiCompositeNumber subscribed.Sep 16 2023, 7:41 PM

bd808 awarded a token.Jan 5 2024, 1:20 AM

[01:06]  <  legoktm> bd808: I think gitlab-account-approval is stuck?
[01:06]  <  legoktm> the container has been running for 24h
[01:06]  <    bd808> legoktm: ugh. I'll take a look.
[01:07]  <  legoktm> ty, I'm trying to help Phi22ipus (sitting right next to them), I added them to #trusted-contributors on phab
[01:08]  <    bd808> The last thing it logged was "2024-01-04T00:48:11Z glaab.utils INFO: Checking ayazsheikh", so yeah it lost it's mind.
[01:09]  <    bd808> I just killed to pod. Let me see if I remember how to run the verbose mode...

[01:13]  <   wm-bot> !log tools.gitlab-account-approval <bd808> Killed stuck pod running for >24h after ping by legoktm on irc

Being able to put a time limit on this particular job probably would have fixed things much sooner than 24h.

bd808 mentioned this in T356097: gitlab-account-approval bot stalled on 2024-01-09.Feb 1 2024, 12:42 AM

We are working on adding health checks to the jobs T335592: [jobs-api,jobs-cli] Support job health checks and accessing the APIs from the containers themselves T321919: Figure out and document how to call the Kubernetes API as your tool user from inside a pod, I'll add the checks as dependent as that's all that's needed to force the restart of the pod.

dcaro changed the task status from Open to Stalled.Mar 5 2024, 1:46 PM

dcaro triaged this task as Medium priority.

dcaro added a subtask: T335592: [jobs-api,jobs-cli] Support job health checks.

dcaro edited projects, added Toolforge; removed Toolforge Jobs framework.

dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.

dcaro renamed this task from Allow Toolforge scheduled jobs to have a maximum runtime to [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime.Mar 5 2024, 3:54 PM

JJMC89 mentioned this in T360135: CopyPatrol is down - March 14, 2024.Mar 14 2024, 6:03 PM

TheresNoTime subscribed.Mar 14 2024, 11:16 PM

Raymond_Ndibe closed subtask T335592: [jobs-api,jobs-cli] Support job health checks as Resolved.Apr 4 2024, 9:57 PM

we now have a --health-check-script argument that allows you to provide a custom health script that kubernetes uses to decide when your workload becomes unhealthy so it can be restarted. This only works for continuous jobs

That's not particularly useful for this task about CronJobs.

In T306391#9691083, @AntiCompositeNumber wrote:

That's not particularly useful for this task about CronJobs.

It can be converted to a ContinuousJob with the health check script, that has a side advantage that you'll kill it only when it's unhealthy, not only by fixed timeout.

For example, something like this will restart the pod when there's no myprocess running (ex. killed by OOM):

toolforge jobs run --continuous --command "myprocess" --health-check-script 'sh -c "ps aux | grep -q myprocess"' myjob

taavi reopened subtask T335592: [jobs-api,jobs-cli] Support job health checks as Open.Apr 5 2024, 11:53 AM

taavi removed a subtask: T335592: [jobs-api,jobs-cli] Support job health checks.

I don't see how I could use a continuous job to update a wiki page once a week.

In T306391#9692512, @taavi wrote:

I don't see how I could use a continuous job to update a wiki page once a week.

I don't have a full picture of what is needed here, and I don't know what the job actually does.
My suggestion for a continuous job was based on the supposition that it's schedule is "as soon as possible" instead of weekly (not mentioned until now).

So let me try to clarify for myself, and correct me/elaborate if I'm wrong:

There's a script that updates a wiki page, that needs to run once a week (thanks @taavi)
When that script runs, sometimes it gets stuck and needs to be stopped and restarted
- It failed once with OOM, and k8s did not retry because sees the pod as running
- Any other ways it gets stuck? (asking to get ideas on how to detect that it's stuck)
Every time it restarts, it progresses a little bit (I'm guessing this)
Eventually it succeeds and then it should wait until next week
- From the moment it succeeds or from the moment it finishes?

Did a quick test trying to reproduce the OOM issue, and I was unable to get a 'running' pod after being killed by OOM :/, I think the script @MusikAnimal uses should work ok, this is what I see:

Running out of memory from within a script:

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs run --retry 5 --mem 100Mi --command 'echo starting; sleep 10; echo about to blow; python3 -c "import time; bytearray(512000000); time.sleep(10);"; echo finished; sleep 10; echo refinished' --schedule '0 0 * * *' --image python3.11 memtests

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs restart memtests

tools.wm-lol@tools-sgebastion-10:~$ tail -f memtests.*
==> memtests.err <==

==> memtests.out <==
starting
about to blow
finished

==> memtests.err <==
Killed

==> memtests.out <==
refinished


tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs show memtests
...
--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Status:       | Waiting for scheduled time                                                                                                                             |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Hints:        | Last run at 2024-04-05T13:45:53Z. Pod in 'Succeeded' phase. State                                                                                      |
|               | 'terminated'. Reason 'OOMKilled'. Started at '2024-04-05T13:45:55Z'.                                                                                   |
|               | Finished at '2024-04-05T13:46:15Z'. Exit code '0'.                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

This will make the python process get killed (OOM), but the script will continue running, and as the exit code is 0 it will not retry.

Passing the --retry 5 (already there) and using set -o errexit so it exits with return code != 0, makes it retry 5 times before giving up and breaks the script execution as expected:

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs run --retry 5 --mem 100Mi --command 'set -o errexit; echo starting; sleep 10; echo about to blow; python3 -c "import time; bytearray(512000000); time.sleep(10);"; echo finished; sleep 10; echo refinished' --schedule '0 0 * * *' --image python3.11 memtests
...
tools.wm-lol@tools-sgebastion-10:~$ tail -f memtests.*
==> memtests.err <==

==> memtests.out <==
starting
about to blow

==> memtests.err <==
Killed
Killed

==> memtests.out <==
starting
about to blow

==> memtests.err <==
Killed

==> memtests.out <==
starting
about to blow
starting

==> memtests.err <==
Killed

==> memtests.out <==
about to blow

==> memtests.err <==
Killed

==> memtests.out <==
starting
about to blow
starting

==> memtests.err <==
Killed

==> memtests.out <==
about to blow
...

To debug further it will help a lot getting my hands on one of those stuck pods, if you get any and it does not block you can you ping me?
Also the output of toolforge jobs show <thatjob> should help a bit too.

It might have been something else that got the process stuck (ex. NFS could be a culprit, as it would get the process stuck but not kill it)>

@dcaro I have a stuck pod bot-monitor-28525880-5gm4t in my tool account sdzerobot. The job generally takes only a couple of minutes. But today, I saw the pod has been stuck for 11 days for no explicable reason (nothing relevant in the out or err logs). Grafana shows just 600 kb of memory usage so it doesn't seem to due to OOM.

It's a scheduled job supposed to run hourly. But because of the stuck pod, all hourly runs for the past 11 days have been skipped because JobAlreadyActive: Not starting job because prior execution is running and concurrency policy is Forbid. That is why being able to set k8s' activeDeadlineSeconds via jobs framework would really help.

In T306391#9697442, @SD0001 wrote:

@dcaro I have a stuck pod bot-monitor-28525880-5gm4t in my tool account sdzerobot. The job generally takes only a couple of minutes. But today, I saw the pod has been stuck for 11 days for no explicable reason (nothing relevant in the out or err logs). Grafana shows just 600 kb of memory usage so it doesn't seem to due to OOM.

It's a scheduled job supposed to run hourly. But because of the stuck pod, all hourly runs for the past 11 days have been skipped because JobAlreadyActive: Not starting job because prior execution is running and concurrency policy is Forbid. That is why being able to set k8s' activeDeadlineSeconds via jobs framework would really help.

Awesome, looking

note though that activeDeadlineSeconds is a bit special, as it takes into account the whole execution of the job, it would not retry again once that limit is reached, so probably we would not want to use that option and instead something different, from the docs:

Note that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds, even if the backoffLimit is not yet reached.

It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, looking

In T306391#9697531, @dcaro wrote:

It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, looking

I had to restart the worker as it was not able to start a shell, it seems the worker had been misbehaving for a while :/

It seems it might have started last month:

root@tools-k8s-worker-nfs-56:~# journalctl --boot -1 | grep tools-nfs
Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK

:/, we should start automating restarts/alerting on that

dcaro changed the status of subtask T362093: [infra] Add alert when workers have a sustained large amount of D processes from Open to In Progress.Apr 8 2024, 3:38 PM

The use case mentioned in the task (CopyPatrol) will soon no longer be an issue as we're moving to Cloud VPS. I still have a similar situation for one of the musikbot tasks, though, which is a continuous job so --health-check-script may be helpful there. I'll give it a try.

The other use cases mentioned here (hypothetical or not) are still valid, such as a job that runs once a week.

dcaro closed subtask T362093: [infra] Add alert when workers have a sustained large amount of D processes as Resolved.Apr 17 2024, 3:29 PM

Krinkle awarded a token.Sep 9 2024, 9:25 PM

Every once in a while I get cron jobs stuck in toolforge-jobs/Kubernetes for mysterious reasons. In this state, the job is stuck for multiple days continously with seemingly no self-corrective measure in sight.

The fileprotectionsync script is used by KrinkleBot to power Wikimedia Commons' Autoprotect across wikis. As an example, when the fileprotectionsync job gets stuck there is no sign of life in the out and err files. To my knowledge there is not a way for the script to run for more than a few seconds without at least some output. It's quite verbose. When it gets stuck, the output reflects either that the command hasn't been started or has already finished seemingly fine. There is nothing in the out/err files indicating that the script is actually running. It seems to be suggest Kubernetes or an internal toolforge-jobs wrapper is occupying an execution slot on behalf of the job, but not actually running anything. As such, given that jobs aren't meant to overlap, the schedule is effectively disabled and the job is permanently down until the "running" entry is manually deleted.

As a workaround we've prefixed the cron command with timeout <duration> <command>, which leverages timeout from gnu-coreutils.

https://commons.wikimedia.org/wiki/User_talk:Krinkle#c-AntiCompositeNumber-20240908223800-Legoktm-20240908042000

https://gerrit.wikimedia.org/r/c/labs/tools/fileprotectionsync/+/1071365

We'll have to see whether this is actually effective in mitigating stuck jobs in the future. If it is effective, that suggests the stagnation is inside our command after all.

My preferred solution for this is a concurrencyPolicy of Replace, which translates to "no matter what happened with k8s or the script or whatever, whenever the CronJob runs again clobber it and try again", especially for bot tasks like fileprotectionsyncor AntiCompositeBot's NoLicense task that run frequently for a short but variable time. That's not currently available in toolforge jobs and is the blocker for me moving AntiCompositeBot to it.

AntiCompositeNumber mentioned this in T377782: Add --timeout to toolforge jobs .Wed, Nov 6, 4:41 PM

bd808 mentioned this in T379130: Approval job can get stuck and prevent subsequent jobs from firing.Wed, Nov 6, 5:48 PM

dcaro merged a task: T377782: Add --timeout to toolforge jobs .Thu, Nov 7, 5:15 PM

dcaro added subscribers: Multichill, Superyetkin, RLazarus.

Restricted Application added a project: cloud-services-team. · View Herald TranscriptThu, Nov 7, 5:15 PM

dcaro added a parent task: T379132: chie-bot: Jobs hang on toolforge.Thu, Nov 7, 5:18 PM

In T306391#10132235, @AntiCompositeNumber wrote:

My preferred solution for this is a concurrencyPolicy of Replace, which translates to "no matter what happened with k8s or the script or whatever, whenever the CronJob runs again clobber it and try again", especially for bot tasks like fileprotectionsyncor AntiCompositeBot's NoLicense task that run frequently for a short but variable time. That's not currently available in toolforge jobs and is the blocker for me moving AntiCompositeBot to it.

For the concurrency policy see T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`), will leave this task for the --timeout option.

@dcaro if you merge a task, can you please copy over relevant information? In this case the fact someone already wrote the code in T376099 is quite relevant and that went missing.

In T306391#10310309, @Multichill wrote:

@dcaro if you merge a task, can you please copy over relevant information? In this case the fact someone already wrote the code in T376099 is quite relevant and that went missing.

Yep, sorry about that, the code in that task is not quite straight-forward to be used in toolforge, so even it's a solution for a different script, the final code is not going to be able to reuse much from there.

dcaro changed the task status from Stalled to In Progress.Thu, Nov 14, 1:56 PM

dcaro claimed this task.

dcaro edited projects, added Toolforge (Toolforge iteration 16); removed Toolforge.

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 16) board.

dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/129

scheduled jobs: add timeout option

dcaro opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-cli/-/merge_requests/76

scheduled job: add timeout parameter

dcaro moved this task from In Progress to In Review on the Toolforge (Toolforge iteration 16) board.Tue, Nov 19, 5:52 PM

[jobs-api] Allow Toolforge scheduled jobs to have a maximum runtimeOpen, In Progress, MediumPublicFeatureActions

Description

Related ObjectsSearch...

Event Timeline

[jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime
Open, In Progress, MediumPublicFeature
Actions

Related Objects
Search...