[jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs
Open, MediumPublic
Actions

Assigned To

Authored By

	Urbanecm
	Oct 17 2024, 9:20 AM

Description

I'm a maintainer of the urbanecmbot tool at Toolforge. On occasion, one of the jobs I have stalls, the pod becomes unresponsive and the job is not doing anything useful. Unfortunately, the framework is still convinced that the job is still running, even though it is not, and as a result, it is not getting rescheduled. This continues until I notice (or am notified) that the job is down, and manually restart it.

An example from today:

tools.urbanecmbot@tools-bastion-13 ~ 
$ toolforge-jobs show afd-announcer
+---------------+-----------------------------------------------------------------------------------------+
| Job name:     | afd-announcer                                                                           |
+---------------+-----------------------------------------------------------------------------------------+
| Command:      | ~/bin/oznamovatelbot /data/project/urbanecmbot/11bots/cswiki/userbots/announcers/afd.py |
+---------------+-----------------------------------------------------------------------------------------+
| Job type:     | schedule: */5 * * * *                                                                   |
+---------------+-----------------------------------------------------------------------------------------+
| Image:        | python3.9                                                                               |
+---------------+-----------------------------------------------------------------------------------------+
| Port:         | none                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| File log:     | yes                                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Output log:   | /data/project/urbanecmbot/afd-announcer.out                                             |
+---------------+-----------------------------------------------------------------------------------------+
| Error log:    | /data/project/urbanecmbot/afd-announcer.err                                             |
+---------------+-----------------------------------------------------------------------------------------+
| Emails:       | onfailure                                                                               |
+---------------+-----------------------------------------------------------------------------------------+
| Resources:    | default                                                                                 |
+---------------+-----------------------------------------------------------------------------------------+
| Replicas:     | 1                                                                                       |
+---------------+-----------------------------------------------------------------------------------------+
| Mounts:       | all                                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Retry:        | no                                                                                      |
+---------------+-----------------------------------------------------------------------------------------+
| Health check: | none                                                                                    |
+---------------+-----------------------------------------------------------------------------------------+
| Status:       | Running for 2d6h55m                                                                     |
+---------------+-----------------------------------------------------------------------------------------+
| Hints:        | Last run at 2024-10-15T02:20:06Z. Pod in 'Running' phase. State                         |
|               | 'running'. Started at '2024-10-15T02:20:07Z'.                                           |
+---------------+-----------------------------------------------------------------------------------------+
tools.urbanecmbot@tools-bastion-13 ~ 
$

I tried execing into the container via kubectl, and the command left hanging forever.

Can we add a timeout or another form of health check to the framework?

Related Objects

Mentioned In: T379132: chie-bot: Jobs hang on toolforge
T377782: Add --timeout to toolforge jobs
T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`)
Mentioned Here: T375366: [jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side

Event Timeline

Urbanecm created this task.Oct 17 2024, 9:20 AM

Restricted Application added a project: cloud-services-team. · View Herald TranscriptOct 17 2024, 9:20 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Related IRC conversation from -cloud:

10:53 <urbanecm> hey! any idea why a toolforge job would just freeze for hours/days? Is there a way to define a timeout for a job, stopping it when it hits that?
10:55 <urbanecm> related job: https://k8s-status.toolforge.org/namespaces/tool-urbanecmbot/pods/afd-announcer-28815980-p9g27/
11:06 <arturo> urbanecm: mmmm I don't remember if we support healthchecks for cronjobs
11:07 <urbanecm> we probably should, it's not the most convenient thing to have to notice it fails and restart it manually
11:10 <arturo> urbanecm: if you open a phab ticket requesting the feature, I'll make sure it gets attention from the team. I think the change is somewhat simple
11:21 <urbanecm> arturo: sure, sounds good. filled T377420, let me know if you want me to add anything else
11:21 <+stashbot> T377420: Introduce health checks for Toolforge Jobs Framework - https://phabricator.wikimedia.org/T377420

change would be:

move healthcheck template into the pod template, see:
- https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/jobs.py?ref_type=heads#L279
- https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/jobs.py?ref_type=heads#L141
update CLI

aborrero triaged this task as Medium priority.Oct 17 2024, 9:25 AM

aborrero edited projects, added Toolforge (Toolforge iteration 16), User-aborrero; removed Toolforge.

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

Raymond_Ndibe added a project: User-Raymond_Ndibe.Oct 17 2024, 3:17 PM

Raymond_Ndibe subscribed.

I wonder if adding support for declaring concurrencyPolicy: Replace for a scheduled job would also be helpful? Something like toolforge jobs run --image foo --command bar --schedule '*/5 * * * *' --replace job-that-should-be-killed-if-still-running-when-the-next-schedule-fires could setup a CronJob instance that will be force killed by Kubernetes if a stale copy of the job is still active when the next scheduled run is due to start. Toolhub uses this Kubernetes behavior as a workaround for a non-terminating side car container in a CronJob for its production deployment.

Leloiandudu subscribed.Oct 17 2024, 9:53 PM

In T377420#10239144, @bd808 wrote:

I wonder if adding support for declaring concurrencyPolicy: Replace for a scheduled job would also be helpful? Something like toolforge jobs run --image foo --command bar --schedule '*/5 * * * *' --replace job-that-should-be-killed-if-still-running-when-the-next-schedule-fires could setup a CronJob instance that will be force killed by Kubernetes if a stale copy of the job is still active when the next scheduled run is due to start. Toolhub uses this Kubernetes behavior as a workaround for a non-terminating side car container in a CronJob for its production deployment.

I'm reading https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#concurrency-policy and yes, this seems interesting. We could actually support both things (healthcheck and concurrencypolicy). Maybe we would explore that on a different ticket?

• bd808 mentioned this in T377781: [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`).Oct 21 2024, 8:47 PM

• bd808 mentioned this in T377782: Add --timeout to toolforge jobs .Oct 21 2024, 9:05 PM

Raymond_Ndibe claimed this task.Oct 25 2024, 12:00 PM

Maybe it's not such a bad idea to add some options to configure concurrency in the different job types. Let's give it a think and see if we can come up with a nice abstraction (I like the --replace for scheduled jobs).

Leloiandudu mentioned this in T379132: chie-bot: Jobs hang on toolforge.Nov 6 2024, 1:42 AM

Note, health checks would not force the pod to be reallocated to another worker, just restart the container, so this would not help in the case of NFS getting stuck.

dcaro renamed this task from Introduce health checks for Toolforge Jobs Framework cronjobs to [jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs.Nov 7 2024, 5:34 PM

[jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobsOpen, MediumPublicActions

Description

Related Objects

Event Timeline

[jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs
Open, MediumPublic
Actions