Page MenuHomePhabricator

[jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs
Open, MediumPublic

Description

I'm a maintainer of the urbanecmbot tool at Toolforge. On occasion, one of the jobs I have stalls, the pod becomes unresponsive and the job is not doing anything useful. Unfortunately, the framework is still convinced that the job is still running, even though it is not, and as a result, it is not getting rescheduled. This continues until I notice (or am notified) that the job is down, and manually restart it.

An example from today:

tools.urbanecmbot@tools-bastion-13 ~ 
$ toolforge-jobs show afd-announcer
+---------------+-----------------------------------------------------------------------------------------+
| Job name:     | afd-announcer                                                                           |
+---------------+-----------------------------------------------------------------------------------------+
| Command:      | ~/bin/oznamovatelbot /data/project/urbanecmbot/11bots/cswiki/userbots/announcers/afd.py |
+---------------+-----------------------------------------------------------------------------------------+
| Job type:     | schedule: */5 * * * *                                                                   |
+---------------+-----------------------------------------------------------------------------------------+
| Image:        | python3.9                                                                               |
+---------------+-----------------------------------------------------------------------------------------+
| Port:         | none                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| File log:     | yes                                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Output log:   | /data/project/urbanecmbot/afd-announcer.out                                             |
+---------------+-----------------------------------------------------------------------------------------+
| Error log:    | /data/project/urbanecmbot/afd-announcer.err                                             |
+---------------+-----------------------------------------------------------------------------------------+
| Emails:       | onfailure                                                                               |
+---------------+-----------------------------------------------------------------------------------------+
| Resources:    | default                                                                                 |
+---------------+-----------------------------------------------------------------------------------------+
| Replicas:     | 1                                                                                       |
+---------------+-----------------------------------------------------------------------------------------+
| Mounts:       | all                                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Retry:        | no                                                                                      |
+---------------+-----------------------------------------------------------------------------------------+
| Health check: | none                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| Status:       | Running for 2d6h55m                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Hints:        | Last run at 2024-10-15T02:20:06Z. Pod in 'Running' phase. State                         |
|               | 'running'. Started at '2024-10-15T02:20:07Z'.                                           |
+---------------+-----------------------------------------------------------------------------------------+
tools.urbanecmbot@tools-bastion-13 ~ 
$

I tried execing into the container via kubectl, and the command left hanging forever.

Can we add a timeout or another form of health check to the framework?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Related IRC conversation from -cloud:

10:53 <urbanecm> hey! any idea why a toolforge job would just freeze for hours/days? Is there a way to define a timeout for a job, stopping it when it hits that?
10:55 <urbanecm> related job: https://k8s-status.toolforge.org/namespaces/tool-urbanecmbot/pods/afd-announcer-28815980-p9g27/
11:06 <arturo> urbanecm: mmmm I don't remember if we support healthchecks for cronjobs
11:07 <urbanecm> we probably should, it's not the most convenient thing to have to notice it fails and restart it manually
11:10 <arturo> urbanecm: if you open a phab ticket requesting the feature, I'll make sure it gets attention from the team. I think the change is somewhat simple
11:21 <urbanecm> arturo: sure, sounds good. filled T377420, let me know if you want me to add anything else
11:21 <+stashbot> T377420: Introduce health checks for Toolforge Jobs Framework - https://phabricator.wikimedia.org/T377420
aborrero renamed this task from Introduce health checks for Toolforge Jobs Framework to Introduce health checks for Toolforge Jobs Framework cronjobs.Oct 17 2024, 9:21 AM
aborrero subscribed.
aborrero triaged this task as Medium priority.Oct 17 2024, 9:25 AM
aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

I wonder if adding support for declaring concurrencyPolicy: Replace for a scheduled job would also be helpful? Something like toolforge jobs run --image foo --command bar --schedule '*/5 * * * *' --replace job-that-should-be-killed-if-still-running-when-the-next-schedule-fires could setup a CronJob instance that will be force killed by Kubernetes if a stale copy of the job is still active when the next scheduled run is due to start. Toolhub uses this Kubernetes behavior as a workaround for a non-terminating side car container in a CronJob for its production deployment.

I wonder if adding support for declaring concurrencyPolicy: Replace for a scheduled job would also be helpful? Something like toolforge jobs run --image foo --command bar --schedule '*/5 * * * *' --replace job-that-should-be-killed-if-still-running-when-the-next-schedule-fires could setup a CronJob instance that will be force killed by Kubernetes if a stale copy of the job is still active when the next scheduled run is due to start. Toolhub uses this Kubernetes behavior as a workaround for a non-terminating side car container in a CronJob for its production deployment.

I'm reading https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#concurrency-policy and yes, this seems interesting. We could actually support both things (healthcheck and concurrencypolicy). Maybe we would explore that on a different ticket?

Similar case on customizing k8s concurrency T375366: [jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side

Maybe it's not such a bad idea to add some options to configure concurrency in the different job types. Let's give it a think and see if we can come up with a nice abstraction (I like the --replace for scheduled jobs).

Note, health checks would not force the pod to be reallocated to another worker, just restart the container, so this would not help in the case of NFS getting stuck.

dcaro renamed this task from Introduce health checks for Toolforge Jobs Framework cronjobs to [jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs.Nov 7 2024, 5:34 PM