Page MenuHomePhabricator

[jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`)
Open, HighPublicFeature

Description

Forked from discussion in T377420: [jobs-api,jobs-cli] Introduce a way to stop stuck cronjobs

I wonder if adding support for declaring concurrencyPolicy: Replace for a scheduled job would also be helpful? Something like toolforge jobs run --image foo --command bar --schedule '*/5 * * * *' --replace job-that-should-be-killed-if-still-running-when-the-next-schedule-fires could setup a CronJob instance that will be force killed by Kubernetes if a stale copy of the job is still active when the next scheduled run is due to start. Toolhub uses this Kubernetes behavior as a workaround for a non-terminating side car container in a CronJob for its production deployment.

I'm reading https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#concurrency-policy and yes, this seems interesting. We could actually support both things (healthcheck and concurrencypolicy). Maybe we would explore that on a different ticket?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Added some exploration here T379130: Approval job can get stuck and prevent subsequent jobs from firing, it seems that for simple use cases (and NFS issues), this will help better than liveness probes or wrapping your command with the timeout command.

I would set this unconditionally, document the behavior, and let the users deal with their code or schedule not finishing in time for the next run.

We can always make this optional in the future, by introducing --replace or similar to the CLI, if we get reports of the default behavior being undesirable.

Another approach would be to 'guess' the right value with some heuristic, for example:

  • for jobs scheduled more than once a day, use replace by default
  • for all other jobs, don't

My goal with this suggestion is to avoid introducing new command line/API options if we can avoid it, otherwise we will end up exposing the whole k8s API.

I find this feature very common on all cron-like systems (just search for how to avoid cron overlapping).

Though we might be able to work around it with the --timeout option, that also sounds pretty generic, it does give more control though (the timeout instead of being whenever the cron runs again, has to be passed manually). It's still adding an option to the cli though. Wdyt?

dcaro triaged this task as High priority.Nov 7 2024, 4:45 PM
dcaro moved this task from Backlog to Ready to be worked on on the Toolforge board.

For the concurrency configuration, I'm thinking something a bit more explicit than replace, usually there's 3 behaviors you want from a cron-like system when the new schedule triggers:

  • stop the old run, and start the new
  • do nothing to the old, and start the new (overlapping the runs)
  • keep the old run and don't start the new one (letting the old finish)

For that I would instead of having a single --replace option, have something in the lines of --on-overlap={stop-old,dont-start-new,run-both} (or whatever, k8s uses replace, forbid and allow, though not sure if those would be clearer, naming is hard).

Maybe start implementing only the ones we currently have requests for, that is stop-old, and dont-start-new (the current default), and if/whenever we get requests for the third one of allowing concurrent runs, add it too (or any other concurrency pattern, like N-jobs only).

I would set this unconditionally, document the behavior, and let the users deal with their code or schedule not finishing in time for the next run.

Please don't. I already have long jobs that get restarted by k8s maintenance/failures that should continue to run to completion and not restart at the next scheduled time.

We can always make this optional in the future, by introducing --replace or similar to the CLI, if we get reports of the default behavior being undesirable.

Please do.

Another approach would be to 'guess' the right value with some heuristic, for example:

  • for jobs scheduled more than once a day, use replace by default
  • for all other jobs, don't

Please don't. I have jobs that run multiple times a day that should continue running to completion and not restart at the next schedule.

dcaro renamed this task from Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`) to [jobs-api,jobs-cli] Add support for replacing a running scheduled job when an overlapping schedule fires (`concurrencyPolicy: Replace`).Nov 7 2024, 5:35 PM