Page MenuHomePhabricator

Investigate startingDeadlineSeconds setting for kubernetes CronJobs
Closed, ResolvedPublic

Description

startingDeadlineSeconds is a setting that defines how long after the scheduled time the controller can still schedule a run of a CronJob (see Deadline for delayed Job start - Kubernetes documentation).

This affects key aspects of running CronJobs:

Setting concurrencyPolicy to Forbid

Without setting startingDeadlineSeconds, the CronJob will "miss" a scheduling every 10s starting at its next scheduled time. If the job overruns its next scheduling time by 1000s, the CronJob controller will stop trying to schedule execution of that CronJob because it will have missed scheduling 100 times.
If startingDeadlineSeconds is set, the controller only looks at the number of scheduling failures inside of the time window between the scheduled start time and (scheduled start time + startingDeadlineSeconds).
If the controller can't start the Job within startingDeadlineSeconds it will skip that execution, but I am unsure if it creates a Job and marks it as failed, or just logs a failed execution at the CronJob level (my gut says the second, but we need to verify that).

Suspending CronJobs without setting startingDeadlineSeconds

From Schedule suspension - Kubernetes documentation

Executions that are suspended during their scheduled time count as missed Jobs. When .spec.suspend changes from true to false on an existing CronJob without a starting deadline, the missed Jobs are scheduled immediately.

It is unclear from the documentation if this means it will try to schedule all the Jobs that were missed in the suspension interval, or only the last one. If startingDeadlineSeconds is set, it will presumably have skipped the previous executions, and will only start the Job immediately if within the (last scheduled start time + startingDeadlineSeconds) window.

This also means (indirectly related) that if a CronJob without a startingDeadlineSeconds setting is suspended for longer than 100x its scheduled interval, it will only restart if the CronJob object is deleted and recreated.

Current situation

Currently, we do not set that value, and the default concurrencyPolicy is Replace.

With these settings, a job overrunning its interval will not complete, and will get replaced by a new instance (see T394018: Link Recommendation Task pool data missing for some wikis for instance).
If the Job doesn't do checkpointing, it will presumably never complete.

T394409: Add a way to suspend CronJobs allows suspending a job without care for startingDeadlineSeconds being set or not

Event Timeline

Clement_Goubert updated the task description. (Show Details)
Clement_Goubert updated the task description. (Show Details)

Confirming that:

  1. Suspending a CronJob through helmfile is an edit, and not a delete/create of the CronJob object
  2. It will start a Job for a suspended CronJob immediately on unsuspend if startingDeadLineSeconds isn't set, starting only one job.

Change #1146010 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::periodic_job: add concurrency parameter to k8s jobs

https://gerrit.wikimedia.org/r/1146010

Change #1147709 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add startingDeadlineSeconds to CronJobs

https://gerrit.wikimedia.org/r/1147709

Change #1147710 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_job: add startingdeadlineseconds

https://gerrit.wikimedia.org/r/1147710

Clement_Goubert changed the task status from Open to In Progress.May 19 2025, 10:52 AM
Clement_Goubert triaged this task as High priority.

Change #1147709 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add startingDeadlineSeconds to CronJobs

https://gerrit.wikimedia.org/r/1147709

Mentioned in SAL (#wikimedia-operations) [2025-05-19T11:26:09Z] <cgoubert@deploy1003> Started scap sync-world: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423

Mentioned in SAL (#wikimedia-operations) [2025-05-19T11:26:42Z] <cgoubert@deploy1003> cgoubert: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423 synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-05-19T11:28:12Z] <cgoubert@deploy1003> Finished scap sync-world: 1147709: mediawiki: Add startingDeadlineSeconds to CronJobs - T394423 (duration: 02m 16s)

Change #1146010 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_job: add concurrency parameter to k8s jobs

https://gerrit.wikimedia.org/r/1146010

Change #1147710 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_job: add startingdeadlineseconds

https://gerrit.wikimedia.org/r/1147710

We should check jobs for completion and the longer ones should also get a proper startingDeadlineSeconds and concurrency so they have a chance to complete.

(Removed the paste as it wasn't correctly identifying non-completing jobs)

Change #1147778 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs

https://gerrit.wikimedia.org/r/1147778

Change #1147778 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance: Add ttlsecondsafterfinished to long interval jobs

https://gerrit.wikimedia.org/r/1147778

Jobs that need to complete and not be restarted if they run longer should set concurrency_policy: Forbid and a proper (around half the interval, but can be fine tuned) startingDeadlineSeconds

Jobs that can handle or need to restart should stick to the default settings.