Page MenuHomePhabricator

Pipeline jobs freezing during teardown
Closed, DuplicatePublic

Description

Some pipeline jobs are reportedly getting stuck at the teardown step running docker rmi --force ...

One example: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/118/console

Event Timeline

jeena created this task.Mon, Nov 2, 9:33 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Nov 2, 9:33 PM
bd808 added a subscriber: bd808.Mon, Nov 2, 9:35 PM

I manually killed this one when I felt like it was just waiting to timeout: https://integration.wikimedia.org/ci/job/wikimedia-toolhub-pipeline-test/111/console

hashar added a subscriber: hashar.Mon, Nov 2, 10:16 PM

From ps:

 9354 ?        S      0:00 sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe  '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- &
 9355 ?        S      0:00  \_ sh -c ({ while [ -d '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9' -a \! -f '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt' ]; do touch '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt'; sleep 3; done } & jsc=durable-1e6b7686b5c01eb68c85140d0ea74ca0; JENKINS_SERVER_COOKIE=$jsc '/bin/bash' -xe  '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh' > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-log.txt' 2>&1; echo $? > '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp'; mv '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt.tmp' '/srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/jenkins-result.txt'; wait) >&- 2>&- &
19735 ?        S      0:00  |   \_ sleep 3
 9356 ?        S      0:00  \_ /bin/bash -xe /srv/jenkins/workspace/workspace/wikimedia-toolhub-pipeline-test@tmp/durable-265be6d9/script.sh
 9359 ?        Sl     0:00      \_ docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6

The sha1 do not show up in docker ps -a. That might be a bug in docker 19.03.5 which I will eventually get upgraded later on to latest 19.03.x series.

Mentioned in SAL (#wikimedia-releng) [2020-11-02T22:18:25Z] <hashar> Killed Pipeline job , stuck running docker rmi --force ee88e6bf8898 96f16ff7774f ceb93158475f 47e2530d59a6 T267075

Two got deleted, but two are left behind:

$ sudo docker images|grep ceb9315
<none>                                                                    <none>              ceb93158475f        About an hour ago   586MB
$ sudo docker images|grep 47e2530d
<none>                                                                    <none>              47e2530d59a6        About an hour ago   586MB

I strongly suspect that is a duplicate of T265615. Deleting an image/container does a lot of write queries to the disk to dispose of the files and that is throttled on the infrastructure at 500 writes per second:

Those yellow deeps can't go further than 500 iops, which I guess correspond to the image deletions attempt.

The quota is in the process of being raised on all instances T266777.

hashar changed the task status from Duplicate to Resolved.Tue, Nov 3, 9:06 PM

The instance got tuned to allow more IO.

I have deleted the two other images that were on agent 1009 at 2020-11-03 08:55:01 and 2020-11-03 08:55:26. And that definitely faster.