Page MenuHomePhabricator

Pipeline jobs freezing during teardown
Closed, DuplicatePublic

Description

Some pipeline jobs are reportedly getting stuck at the teardown step running docker rmi --force ...

One example: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/118/console

Event Timeline

From ps:

 9354 ?        S      0:00 sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe  '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- &
 9355 ?        S      0:00  \_ sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe  '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- &
19735 ?        S      0:00  |   \_ sleep 3
 9356 ?        S      0:00  \_ /bin/bash -xe /srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh
 9359 ?        Sl     0:00      \_ docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6

The sha1 do not show up in docker ps -a. That might be a bug in docker 19.03.5 which I will eventually get upgraded later on to latest 19.03.x series.

Mentioned in SAL (#wikimedia-releng) [2020-11-02T22:18:25Z] <hashar> Killed Pipeline job , stuck running docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6 T267075

Two got deleted, but two are left behind:

$ sudo docker images|grep ceb9315
<none>                                                                    <none>              ceb93158475f        About an hour ago   586MB
$ sudo docker images|grep 47e2530d
<none>                                                                    <none>              47e2530d59a6        About an hour ago   586MB

I strongly suspect that is a duplicate of T265615. Deleting an image/container does a lot of write queries to the disk to dispose of the files and that is throttled on the infrastructure at 500 writes per second:

1009_iops.png (523×1 px, 66 KB)

Those yellow deeps can't go further than 500 iops, which I guess correspond to the image deletions attempt.

The quota is in the process of being raised on all instances T266777.

hashar changed the task status from Duplicate to Resolved.Nov 3 2020, 9:06 PM

The instance got tuned to allow more IO.

I have deleted the two other images that were on agent 1009 at 2020-11-03 08:55:01 and 2020-11-03 08:55:26. And that definitely faster.