Some pipeline jobs are reportedly getting stuck at the teardown step running docker rmi --force ...
One example: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/118/console
Some pipeline jobs are reportedly getting stuck at the teardown step running docker rmi --force ...
One example: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/118/console
I manually killed this one when I felt like it was just waiting to timeout: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/111/console
From ps:
9354 ? S 0:00 sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- & 9355 ? S 0:00 \_ sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- & 19735 ? S 0:00 | \_ sleep 3 9356 ? S 0:00 \_ /bin/bash -xe /srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh 9359 ? Sl 0:00 \_ docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6
The sha1 do not show up in docker ps -a. That might be a bug in docker 19.03.5 which I will eventually get upgraded later on to latest 19.03.x series.
Mentioned in SAL (#wikimedia-releng) [2020-11-02T22:18:25Z] <hashar> Killed Pipeline job , stuck running docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6 T267075
Two got deleted, but two are left behind:
$ sudo docker images|grep ceb9315 <none> <none> ceb93158475f About an hour ago 586MB $ sudo docker images|grep 47e2530d <none> <none> 47e2530d59a6 About an hour ago 586MB
I strongly suspect that is a duplicate of T265615. Deleting an image/container does a lot of write queries to the disk to dispose of the files and that is throttled on the infrastructure at 500 writes per second:
Those yellow deeps can't go further than 500 iops, which I guess correspond to the image deletions attempt.
The quota is in the process of being raised on all instances T266777.
The instance got tuned to allow more IO.
I have deleted the two other images that were on agent 1009 at 2020-11-03 08:55:01 and 2020-11-03 08:55:26. And that definitely faster.