[jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side
Open, In Progress, LowPublicBUG REPORT
Actions

Assigned To

Authored By

	Wurgl
	Sep 23 2024, 8:41 AM

Description

Steps to replicate the issue (include links if applicable):

restart a continuous running command by the following command
toolforge jobs restart <jobname>

What happens?:
For a short period of time (~15-30 seconds) the "old" jobs still runs in parallel with the new restarted job

What should have happened instead?:
There shall only run one job with the same name at a time

I used this small script

#!/bin/bash
outfile=/data/project/persondata/logs/testout.out
echo "Process started" >>$outfile
while `true`
do
  date >>$outfile
  uname -a >>$outfile
  sleep 2
done

Then on the commandline I started a job like
toolforge jobs run --command '~/playground/continuous.sh' --image php8.2 -o --continuous test
half a minute later I restarted the job:
toolforge jobs restart test
and another 30 later deleted it

The following could be seen in the file /data/project/persondata/logs/testout.out
Please not: After the line "Process started" in the middle of the file, you see two hostnames: Linux test-bf66c58b8-7f8br and Linux test-bf66c58b8-xsjl9

Process started
Mon Sep 23 08:26:34 AM UTC 2024
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:26:36 AM UTC 2024
…
Mon Sep 23 08:26:59 AM UTC 2024
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:01 AM UTC 2024
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Process started
Mon Sep 23 08:27:01 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:03 AM UTC 2024
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:03 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:05 AM UTC 2024
Mon Sep 23 08:27:05 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:07 AM UTC 2024
Mon Sep 23 08:27:07 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:09 AM UTC 2024
Mon Sep 23 08:27:09 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:11 AM UTC 2024
Mon Sep 23 08:27:11 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:13 AM UTC 2024
Mon Sep 23 08:27:13 AM UTC 2024
Linux test-bf66c58b8-xsjl9 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:15 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:17 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
Mon Sep 23 08:27:19 AM UTC 2024
Linux test-bf66c58b8-7f8br 6.1.0-25-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
…

Related Objects

Mentioned In: T377420: [jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs

Event Timeline

Wurgl created this task.Sep 23 2024, 8:41 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2024, 8:42 AM

Kubernetes creates the replacement pod as soon as the first pod enters termination state, without waiting for the first pod to actually disappear.
This is the native (and default) behavior of Kubernetes deployments, see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy

Introducing a different behavior could be as simple as this patch for toolforge jobs-api:

diff --git a/tjf/runtimes/k8s/jobs.py b/tjf/runtimes/k8s/jobs.py
index fd3e85c..272924d 100644
--- a/tjf/runtimes/k8s/jobs.py
+++ b/tjf/runtimes/k8s/jobs.py
@@ -290,6 +290,12 @@ def _get_k8s_deployment_object(job: Job) -> K8S_OBJECT_TYPE:
             "labels": labels,
         },
         "spec": {
+            # strategy: recreate prevents extra pods from running in parallel when restarting the job
+            # see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
+            "strategy" : {
+                "type": "Recreate",
+            },
             "template": _get_k8s_podtemplate(job=job, restart_policy="Always", probes=probes),
             "replicas": replicas,
             "selector": {

However, there are certain use cases in which it is desirable that a given pod for a deployment is always running, specially those that are behind some kind of service load balancer. Not doing so could introduce downtime for services.
So, a patch like the above may introduce a regression from the point of view of some users.

Another approaches:

we could try to infer the right value for spec.strategy at job creation time. If a continuous job has more than one replica, then it should be fine for it to have spec.strategy: RollingUpdate, if the job has one replica, then always use spec.Strategy: Recreate
we could "leak" this config setting into the toolforge-jobs command line, something like toolforge jobs restart <name> --recreate, modify in the fly the spec.strategy of the Deployment, set to Recreate before killing the running pod of the job, then revert the change

I think I like the first option more. I would like to avoid having additional options in the command line, otherwise we will end up just exposing the whole k8s API via config switches, and our toolforge abstraction would loose its purpose.

aborrero changed the task status from Open to In Progress.Sep 23 2024, 9:18 AM

aborrero triaged this task as Low priority.

aborrero added a project: User-aborrero.

aborrero moved this task from Backlog to Radar/observer on the User-aborrero board.

I think this is the patch I'm proposing:

diff --git a/tjf/runtimes/k8s/jobs.py b/tjf/runtimes/k8s/jobs.py
index fd3e85c..22eff5d 100644
--- a/tjf/runtimes/k8s/jobs.py
+++ b/tjf/runtimes/k8s/jobs.py
@@ -281,6 +281,10 @@ def _get_k8s_deployment_object(job: Job) -> K8S_OBJECT_TYPE:
     probes = get_healthcheck_for_k8s(job.health_check) if job.health_check else {}
     replicas = job.replicas if job.replicas else 1
 
+    # see https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
+    # see T375366
+    strategy = "Recreate" if replicas == 1 else "RollingUpdate"
+
     obj = {
         "apiVersion": K8sJobKind.DEPLOYMENT.api_version,
         "kind": K8sJobKind.DEPLOYMENT.value,
@@ -290,6 +294,9 @@ def _get_k8s_deployment_object(job: Job) -> K8S_OBJECT_TYPE:
             "labels": labels,
         },
         "spec": {
+            "strategy" : {
+                "type": strategy,
+            },
             "template": _get_k8s_podtemplate(job=job, restart_policy="Always", probes=probes),
             "replicas": replicas,
             "selector": {

aborrero opened https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/124

jobs: continuous: set strategy based on number of replicas

aborrero edited projects, added Toolforge (Toolforge iteration 14); removed Toolforge.Sep 23 2024, 9:37 AM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 14) board.Sep 24 2024, 3:25 PM

dcaro edited projects, added Toolforge (Toolforge iteration 15); removed Toolforge (Toolforge iteration 14).Sep 25 2024, 7:56 AM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 15) board.

dcaro edited projects, added Toolforge (Toolforge iteration 16); removed Toolforge (Toolforge iteration 15).Oct 15 2024, 7:33 AM

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 16) board.

dcaro mentioned this in T377420: [jobs-api,jobs-cli] Introduce health checks for Toolforge Jobs Framework cronjobs.Oct 28 2024, 9:45 AM

dcaro assigned this task to aborrero.Oct 29 2024, 2:30 PM

dcaro moved this task from In Progress to Next Up on the Toolforge (Toolforge iteration 16) board.

dcaro moved this task from Next Up to In Progress on the Toolforge (Toolforge iteration 16) board.

dcaro renamed this task from restarting a continuous jobs causes for some seconds two jobs are running side by side to [jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side.Nov 7 2024, 5:35 PM

[jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by sideOpen, In Progress, LowPublicBUG REPORTActions

Description

Related Objects

Event Timeline

[jobs-api,jobs-cli] restarting a continuous jobs causes for some seconds two jobs are running side by side
Open, In Progress, LowPublicBUG REPORT
Actions