Page MenuHomePhabricator

When jenkins kills a build due to max execution time the docker containers stay running
Closed, ResolvedPublic


Initially described in T176623#3635213 and T176623#3635269

If jenkins kills the job, for example in, the container will actually continue running until the command has finished which leads to problems.

  • Slaves can become saturated as the next job may immediately start running
  • Jobs will be running in the same workspace

Perhaps the docker plugin handles this nicely?

See also

We used docker run --tty but --tty always disable signal proxying and thus Jenkins SIGTERM is not relayed to the container. Mentioned to upstream as

We want to catch Jenkins SIGTERM and docker stop.

Event Timeline

Addshore created this task.Sep 26 2017, 1:42 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 26 2017, 1:42 PM
Addshore added a comment.EditedSep 28 2017, 4:37 PM

docker run has the following:

--sig-proxy                      Proxy received signals to the
                                 process (default true)
--stop-signal string             Signal to stop a container
                                 (default "15")
--stop-timeout int               Timeout (in seconds) to stop a

We could always pass a --stop-timeout in when using docker run.
It would be interesting to see if there is a way for jjb to detect / know the timeout of the job it is building the config for and pass that into docker run as a failsafe.

hashar added a subscriber: hashar.Oct 4 2017, 1:14 PM

TL:DR signal is a mess. Jenkins sends SIGTERM and the chain ends up being something like:

  • SIGTERM to docker run
  • Depending on -i, -t, --sig-proxy, --init and or the ENTRYPOINT being in shell form, the signal may or may not reach the process which may or may not kill the process group

Surely we can figure out the proper combo to make sure the process are killed (sigterm) and ultimately ripped off (sigkill). Though the safest is probably to do something like:

function stop_container() {
     docker stop `cat`
trap stop_container EXIT

docker run --cidfile

Then Jenkins send a sigterm to the shell script, it catch it and trigger a docker stop. That eventually SIGKILL it after a default of 10 seconds.

hashar added a comment.Oct 4 2017, 1:15 PM

(we can also look at how the Jenkins Docker pipeline plugin solved it :D )

(we can also look at how the Jenkins Docker pipeline plugin solved it :D )

FWIW, the docker pipeline plugin seems to just be passing a container id around.

DockerClient run returns a container id:

DockerClient stop accepts a container and runs stop and subsequently rm

hashar added a comment.EditedNov 6 2017, 3:17 PM

Based on docker ps --no-trunc, there are containers still running for the images:

12wmfreleng/npm-test:v2017."/bin/sh -c '/bin/bash /'"ENTRYPOINT /bin/bash /
1wmfreleng/operations-puppet:v2017."/bin/sh -c '/bin/bash /'"ENTRYPOINT /bin/bash /
1wmfreleng/tox:v2017."/bin/bash"ENTRYPOINT /bin/bash /

The process are happily leaked / still running.

The Jenkins jobs run them with docker run --rm --tty, we attach a tty for Jenkins.

/bin/sh is linked to dash, I have no idea how it handles signals.

Jenkins does issue SIGTERM to the docker run command, that is relayed to the Docker daemon and then to the bash command which happily ignores it. If one run the container locally and CTRL+C bash shallows it.

I tried with a dumb containers that uses:

ENTRYPOINT ["/"] invokes a python script that trap signals. SIGTERM is definitely shallowed by bash / not passed to the python script.
Though passing --init to docker run kind of fix passing SIGTERM (but not CTRL+C).

Mentioned in SAL (#wikimedia-releng) [2017-11-06T22:29:29Z] <hashar> killed stuck npm Docker containers on integration-slave-docker-1002 (due to T176747 ). Pooled the instance back, the slowness it experienced is probably not related to labvirt CPU usage ( T179378 )

Change 389937 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: handle signals in tox entrypoint

Mentioned in SAL (#wikimedia-releng) [2017-11-09T07:53:17Z] <hasharAway> Killed stuck containers wmfreleng/npm-test on integration-slave-docker-1001 - T176747

hashar added a comment.Nov 9 2017, 3:12 PM

So I have found an example :)

10:41:29 + docker run --rm --tty --volume /srv/jenkins-workspace/workspace/mwgate-npm-node-6-docker/log:/log --volume /srv/jenkins-workspace/workspace/mwgate-npm-node-6-docker/cache:/cache --volume /srv/jenkins-workspace/workspace/mwgate-npm-node-6-docker/src:/src wmfreleng/npm-test:v2017.
10:41:29 + cd /src
10:41:29 + node --version
10:41:29 v6.11.0
10:41:29 + npm --version
10:41:30 3.8.3
10:41:30 + rm -rf node_modules
10:41:30 + npm install --no-progress
10:41:37 Build was aborted
10:41:37 Aborted by anonymous
10:41:37 Archiving artifacts
10:41:37 Finished: ABORTED

Zuul rescheduled that build, asked Jenkins to abort the job. Jenkins then send a SIGTERM to docker run which SIGTERM the /bin/sh -c '/bin/bash'. SIGTERM is caught by one of the shell, eventually npm is left behind unattached and not ripped nor receiving SIGKILL.

I have no good explanation though. What I think of is to :

  1. replace the ENTRYPOINT to use the exec form, which at least get rid of the /bin/sh -c 'xxx' construct
  2. inject an init system in the container by using docker run --init

The later instructs Docker to inject in the container docker-init from the host. That is It is meant to run as PID 1 and will forward most signals (eg SIGTERM / SIGINT) to the defined ENTRYPOINT. It also has a feature to reap (read: SIGKILL) left over process, as I understand it.

Change 390261 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: npm-test ENTRYPOINT to exec form

Mentioned in SAL (#wikimedia-releng) [2017-11-09T15:20:01Z] <hashar> docker push wmfreleng/npm-test:v2017. | | T176747

Change 390261 merged by jenkins-bot:
[integration/config@master] docker: npm-test ENTRYPOINT to exec form

hashar added a comment.Nov 9 2017, 3:41 PM

I gave --init a try, aborting the job once it started npm:

15:38:30 + npm install --no-progress
15:38:43 Build was aborted
15:38:43 Aborted by hashar
15:38:43 Archiving artifacts
15:38:43 /tmp/ line 10:  4809 Terminated              docker run --init --rm --tty --volume "$(pwd)"/log:/log --volume "$(pwd)"/cache:/cache --volume "$(pwd)"/src:/src wmfreleng/npm-test:v2017.

But docker run does not relay the signal and npm still runs behind the scene:

├─dockerd─┬─docker-containe─┬─docker-containe─┬─init───bash───npm─┬─4*[{V8 WorkerThread}]
│         │                 │                 │                   ├─{node}
│         │                 │                 │                   └─4*[{npm}]
│         │                 │                 └─8*[{docker-containe}]
│         │                 └─10*[{docker-containe}]
│         └─11*[{dockerd}]

To document the discussion from IRC the other day, docker run requires --sig-proxy=true to proxy signals to running containers; however, --sig-proxy=true doesn't seem to work if a tty has been allocated. To get the stdout/stderr to output to console log while still proxying signals, the docker run command should be something like:

docker run --rm -a stdout -a stderr --sig-proxy=true [image]

However, that requires the entrypoint to handle the signal being proxied correctly and traping a signal will still wait on execution to run the trap. Running docker run with the --init flag may be a workaround for signal proxying, but this seems to end up sending SIGKILL to the underlying process if it's blocking when you send SIGTERM to the container. I'm not sure that that matters since for all practical purposes this stops the container. The full command then would be something like:

docker run --rm -a stdout -a stderr --sig-proxy=true --init [image]

Another option may be to use the --cidfile flag to write the container ID to a file that we can then act on directly during job cleanup.

hashar updated the task description. (Show Details)Nov 9 2017, 4:06 PM
hashar added a comment.Nov 9 2017, 4:12 PM

So maybe we can end up with:

exec docker run --rm -a stdout -a stderr --sig-proxy=true  --init
hashar added a comment.Nov 9 2017, 9:24 PM

docker CLI source code is at

Bits from cli/command/container/run.go

flags.BoolVar(&opts.sigProxy, "sig-proxy", true, "Proxy received signals to the process")
func runContainer( ... ) error {
    // Disable sigProxy when in TTY mode
    if config.Tty {
        opts.sigProxy = false

Which really is in need of


hashar added a comment.EditedNov 9 2017, 10:36 PM

I wanted to keep the pseudo TTY allocation, at least to get python/ruby etc to do line based buffering instead of some X KBytes block. That is a little more interactive and get us more accurate time stamps in the Jenkins console.

But.. docker run disables sig proxy entirely :(

history time

I eventually trace it back to

That enables the signal proxy by default and disable it when a pseudo tty is allocated. Lets dig deeper.

Previously passing -t and -sig-proxy errored out with

TTY mode (-t) already imply signal proxying (-sig-proxy)

Was done by

Fix issue killing container while sending signal

Ie it used to be that --tty proxified the signal. And in non-tty it was not unless you forced it.

Seems to me e0b59ab5 broke it.

In conclusion I think we want to just:

exec docker run --rm

In my experience stdout/stderr are attached and the signal proxy is passed. With the drawback that python (and others I guess) is not line buffered due to lack of a pseudo tty.

Edit: replied on upstream issue

Change 390341 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Forward Jenkins SIGTERM down to actual process

hashar claimed this task.Nov 9 2017, 11:02 PM
hashar triaged this task as High priority.

Mentioned in SAL (#wikimedia-operations) [2017-11-10T09:30:06Z] <hashar> Upgrading operations-puppet-tests-docker jenkins job to stop passing docker --tty and thus have signals forwarded from 'docker run' - T176747

Change 390341 merged by jenkins-bot:
[integration/config@master] Forward Jenkins SIGTERM down to actual process

Change 389924 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] docker: enhance tox image for signals

Change 389924 merged by jenkins-bot:
[integration/config@master] docker: enhance tox image for signals

Mentioned in SAL (#wikimedia-releng) [2017-11-10T12:39:28Z] <hashar> Updated Jenkins tox jobs | | Does not quite fix T176747 yet thoughaaaaaaaaaaaa

Mentioned in SAL (#wikimedia-releng) [2017-11-10T21:36:18Z] <hasharAway> docker push wmfreleng/tox:v2017. && docker push wmfreleng/tox:latest | docker: handle signals in tox entrypoint | T176747

Mentioned in SAL (#wikimedia-releng) [2017-11-10T21:39:19Z] <hashar> docker push wmfreleng/tox-pyspark:v2017. && docker push wmfreleng/tox-pyspark:latest | | docker: handle signals in tox entrypoint | T176747

Change 389937 merged by jenkins-bot:
[integration/config@master] docker: handle signals in tox entrypoint

hashar updated the task description. (Show Details)Nov 28 2017, 2:21 PM
hashar lowered the priority of this task from High to Low.Jan 10 2018, 5:30 PM

status update

There is almost no more containers leaking nowadays. There are still a few here and there but it is less than a concern.

The proper fix will be to offload running the container to K8S and let it deal with garbage collecting them

hashar closed this task as Resolved.Jan 31 2018, 8:15 PM

The bulk of it has been done, there are still some containers leaking from time to time but it is less of an issue.

Will be fixed by migrating to Kubernetes.

JENKINS-17116 had a patch . It first SIGTERM then after sometime closes stdin/stdout/stderr. The patch is released with Jenkins 2.141.