Page MenuHomePhabricator

[jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime
Open, Stalled, MediumPublicFeature

Description

Some bots on Toolforge such as eranbot, which powers CopyPatrol, run for a long time and can get "stuck". Our solution, albeit admittedly hacky, has been to set a maximum run time on the cron job with jsub … -l h_rt=4:05:00 -once …. This way it will automatically die after 4 hours and 5 minutes, then 5 minutes later it starts up again.

This feature appears to be missing in the new Toolforge jobs framework. Asking on IRC, I was informed Kubernetes has an option for that (.spec.activeDeadlineSeconds) and this could possibly be exposed in the wrapper API.

Additionally, we also need something similar to the -once flag, but apparently the .spec.parallelism option is set to 1 by default so this may effectively already be the case with the Toolforge jobs framework. However some may wish to increase this number, or have the concurrency policy changed, so that may be worth implementing in the wrapper API as well.

Event Timeline

bd808 changed the subtype of this task from "Task" to "Feature Request".

What if you used a bash wrapper script that invoked your real command using timeout (see https://manpages.debian.org/buster/coreutils/timeout.1.en.html)?

The one thing I didn't verify is if toolforge-jobs will indefinitely restart a pod that fails/exits, or whether there's a max retry limit.

If it retries, I'm guessing that means it's another N hours before it auto-kills again. I suppose I can just cut the timeout in half to compensate. Alternatively, we might be able to have a separate job that runs toolforge-jobs delete [job name]?

I think it would be simpler to manage to handle retries inside your bash wrapper script. So inside a loop, it invokes the process with timeout, catches the eventual timeout, then loops, invoking the process again, etc. all without k8s/toolforge-jobs ever thinking that the job failed or exited.

I think it would be simpler to manage to handle retries inside your bash wrapper script. So inside a loop, it invokes the process with timeout, catches the eventual timeout, then loops, invoking the process again, etc. all without k8s/toolforge-jobs ever thinking that the job failed or exited.

So I attempted this and it definitely helped. The bot ran without issue for about two weeks, auto-restarting every 4 hours. Then over this past weekend it decided to die for good. The strange thing is it still showed up as "Running" when I ran toolforge-jobs list. I re-loaded my jobs.yaml file and all is well now, but I'm wondering if there's something I did wrong or could improve with my wrapper script:

export PATH="/data/project/musikbot/.gem/ruby/2.7.0/bin:$PATH"
cd "$HOME/MusikBot/tasks"
while true
do
timeout 4h ruby abusefilter_irc.rb --no-api
sleep 45
done

Does anything look obviously wrong here? The sleep 45 by the way is to give time for the bot to logout of the IRC server. Before when it auto-restarted immediately, there would be a brief time when MusikBot was already logged in (and soon to logout), so the restarted bot showed up as MusikBot_.

The abusefilter-irc.err file did not have any meaningful output, nor did the .out file. I did however receive an email when I reloaded my jobs config, and it reported exit code 137, which I believe indicates the k8s pod ran out of memory. I'm already giving it 1 GB of RAM. It's possible it did run out of memory, but that's what the auto-restarts are supposed to fix... So my guess is the wrapper script (which is still invoked with toolforge-jobs) maintains the process and doesn't free up the RAM that was used, or something? Also mind you again I didn't receive the email until I forced-killed the job with toolforge-jobs load jobs.yaml.

Any ideas? This is still better than nothing as the bot definitely ran longer than it was before I added the wrapper script.

It's possible it did run out of memory, but that's what the auto-restarts are supposed to fix...

The bash script is running within the pod, so when the pod runs out of memory, the bash script dies as well. But such an OOM kill may not cause the pod to be marked as failed (see T246523), which is why it still shows up as running.

I suggest implementing health checks and automated restarts as a separate cronjob. (But as of now such a cronjob can't be run on k8s as container images don't have toolforge-jobs - opened T315729 regarding this).

[01:06]  <  legoktm> bd808: I think gitlab-account-approval is stuck?
[01:06]  <  legoktm> the container has been running for 24h
[01:06]  <    bd808> legoktm: ugh. I'll take a look.
[01:07]  <  legoktm> ty, I'm trying to help Phi22ipus (sitting right next to them), I added them to #trusted-contributors on phab
[01:08]  <    bd808> The last thing it logged was "2024-01-04T00:48:11Z glaab.utils INFO: Checking ayazsheikh", so yeah it lost it's mind.
[01:09]  <    bd808> I just killed to pod. Let me see if I remember how to run the verbose mode...
[01:13]  <   wm-bot> !log tools.gitlab-account-approval <bd808> Killed stuck pod running for >24h after ping by legoktm on irc

Being able to put a time limit on this particular job probably would have fixed things much sooner than 24h.

We are working on adding health checks to the jobs T335592: [jobs-api,jobs-cli] Support job health checks and accessing the APIs from the containers themselves T321919: Figure out and document how to call the Kubernetes API as your tool user from inside a pod, I'll add the checks as dependent as that's all that's needed to force the restart of the pod.

dcaro changed the task status from Open to Stalled.Mar 5 2024, 1:46 PM
dcaro triaged this task as Medium priority.
dcaro edited projects, added Toolforge; removed Toolforge Jobs framework.
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.
dcaro renamed this task from Allow Toolforge scheduled jobs to have a maximum runtime to [jobs-api] Allow Toolforge scheduled jobs to have a maximum runtime.Mar 5 2024, 3:54 PM

we now have a --health-check-script argument that allows you to provide a custom health script that kubernetes uses to decide when your workload becomes unhealthy so it can be restarted. This only works for continuous jobs

That's not particularly useful for this task about CronJobs.

That's not particularly useful for this task about CronJobs.

It can be converted to a ContinuousJob with the health check script, that has a side advantage that you'll kill it only when it's unhealthy, not only by fixed timeout.

For example, something like this will restart the pod when there's no myprocess running (ex. killed by OOM):

toolforge jobs run --continuous --command "myprocess" --health-check-script 'sh -c "ps aux | grep -q myprocess"' myjob

I don't see how I could use a continuous job to update a wiki page once a week.

I don't see how I could use a continuous job to update a wiki page once a week.

I don't have a full picture of what is needed here, and I don't know what the job actually does.
My suggestion for a continuous job was based on the supposition that it's schedule is "as soon as possible" instead of weekly (not mentioned until now).

So let me try to clarify for myself, and correct me/elaborate if I'm wrong:

  • There's a script that updates a wiki page, that needs to run once a week (thanks @taavi)
  • When that script runs, sometimes it gets stuck and needs to be stopped and restarted
    • It failed once with OOM, and k8s did not retry because sees the pod as running
    • Any other ways it gets stuck? (asking to get ideas on how to detect that it's stuck)
  • Every time it restarts, it progresses a little bit (I'm guessing this)
  • Eventually it succeeds and then it should wait until next week
    • From the moment it succeeds or from the moment it finishes?

Did a quick test trying to reproduce the OOM issue, and I was unable to get a 'running' pod after being killed by OOM :/, I think the script @MusikAnimal uses should work ok, this is what I see:

Running out of memory from within a script:

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs run --retry 5 --mem 100Mi --command 'echo starting; sleep 10; echo about to blow; python3 -c "import time; bytearray(512000000); time.sleep(10);"; echo finished; sleep 10; echo refinished' --schedule '0 0 * * *' --image python3.11 memtests

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs restart memtests

tools.wm-lol@tools-sgebastion-10:~$ tail -f memtests.*
==> memtests.err <==

==> memtests.out <==
starting
about to blow
finished

==> memtests.err <==
Killed

==> memtests.out <==
refinished


tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs show memtests
...
--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Status:       | Waiting for scheduled time                                                                                                                             |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| Hints:        | Last run at 2024-04-05T13:45:53Z. Pod in 'Succeeded' phase. State                                                                                      |
|               | 'terminated'. Reason 'OOMKilled'. Started at '2024-04-05T13:45:55Z'.                                                                                   |
|               | Finished at '2024-04-05T13:46:15Z'. Exit code '0'.                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+

This will make the python process get killed (OOM), but the script will continue running, and as the exit code is 0 it will not retry.

Passing the --retry 5 (already there) and using set -o errexit so it exits with return code != 0, makes it retry 5 times before giving up and breaks the script execution as expected:

tools.wm-lol@tools-sgebastion-10:~$ toolforge jobs run --retry 5 --mem 100Mi --command 'set -o errexit; echo starting; sleep 10; echo about to blow; python3 -c "import time; bytearray(512000000); time.sleep(10);"; echo finished; sleep 10; echo refinished' --schedule '0 0 * * *' --image python3.11 memtests
...
tools.wm-lol@tools-sgebastion-10:~$ tail -f memtests.*
==> memtests.err <==

==> memtests.out <==
starting
about to blow

==> memtests.err <==
Killed
Killed

==> memtests.out <==
starting
about to blow

==> memtests.err <==
Killed

==> memtests.out <==
starting
about to blow
starting

==> memtests.err <==
Killed

==> memtests.out <==
about to blow

==> memtests.err <==
Killed

==> memtests.out <==
starting
about to blow
starting

==> memtests.err <==
Killed

==> memtests.out <==
about to blow
...

To debug further it will help a lot getting my hands on one of those stuck pods, if you get any and it does not block you can you ping me?
Also the output of toolforge jobs show <thatjob> should help a bit too.

It might have been something else that got the process stuck (ex. NFS could be a culprit, as it would get the process stuck but not kill it)>

@dcaro I have a stuck pod bot-monitor-28525880-5gm4t in my tool account sdzerobot. The job generally takes only a couple of minutes. But today, I saw the pod has been stuck for 11 days for no explicable reason (nothing relevant in the out or err logs). Grafana shows just 600 kb of memory usage so it doesn't seem to due to OOM.

It's a scheduled job supposed to run hourly. But because of the stuck pod, all hourly runs for the past 11 days have been skipped because JobAlreadyActive: Not starting job because prior execution is running and concurrency policy is Forbid. That is why being able to set k8s' activeDeadlineSeconds via jobs framework would really help.

@dcaro I have a stuck pod bot-monitor-28525880-5gm4t in my tool account sdzerobot. The job generally takes only a couple of minutes. But today, I saw the pod has been stuck for 11 days for no explicable reason (nothing relevant in the out or err logs). Grafana shows just 600 kb of memory usage so it doesn't seem to due to OOM.

It's a scheduled job supposed to run hourly. But because of the stuck pod, all hourly runs for the past 11 days have been skipped because JobAlreadyActive: Not starting job because prior execution is running and concurrency policy is Forbid. That is why being able to set k8s' activeDeadlineSeconds via jobs framework would really help.

Awesome, looking

note though that activeDeadlineSeconds is a bit special, as it takes into account the whole execution of the job, it would not retry again once that limit is reached, so probably we would not want to use that option and instead something different, from the docs:

Note that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit. Therefore, a Job that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds, even if the backoffLimit is not yet reached.

It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, looking

It seems that the issue is related to the NFS server going away and leaving stuck processes in the k8s worker, looking

I had to restart the worker as it was not able to start a shell, it seems the worker had been misbehaving for a while :/

It seems it might have started last month:

root@tools-k8s-worker-nfs-56:~# journalctl --boot -1 | grep tools-nfs
Mar 04 18:29:19 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud not responding, still trying
Mar 07 00:00:28 tools-k8s-worker-nfs-56 kernel: nfs: server tools-nfs.svc.tools.eqiad1.wikimedia.cloud OK

:/, we should start automating restarts/alerting on that

The use case mentioned in the task (CopyPatrol) will soon no longer be an issue as we're moving to Cloud VPS. I still have a similar situation for one of the musikbot tasks, though, which is a continuous job so --health-check-script may be helpful there. I'll give it a try.

The other use cases mentioned here (hypothetical or not) are still valid, such as a job that runs once a week.