Page MenuHomePhabricator

Pipeline lib still leaks containers on contint1001 / contint2001
Closed, ResolvedPublic

Description

T290437 got filed by SRE after the monthly backup showed it had million more files than usual. Looking at inodes:

contint1001:~$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
...
/dev/mapper/vg0-srv     89M   19M   71M   21% /srv
                             ^^^^^

There are 533 left over containers there that should have been ripped off.

We had the issue a year or so ago (T235680) and recently due with images due to a fault in pipelinelib T284125 . But this time it is with containers.

On contint1001 docker ps -a shows more than 500 dangling containers all named plib-run-XXXX.

Event Timeline

Change 702778 had a related patch set uploaded (by Thcipriani; author: Dduvall):

[integration/pipelinelib@master] Enforce pipefail on all run step commands

https://gerrit.wikimedia.org/r/702778

contint1001 went unresponsive today. A series of change got send that triggers pipeline builds and eventually they went causing way too much IO / CPU / memory. That got solved by powercycling the machine T299542

Once back, docker ps did not work much.

dockerd was busy crawling files under /srv/docker/image/overlay2/imagedb/content/sha256/ which has 32476 directories.

One of those files refers to a container created in August 2021. There are also a lot of intermediate images (<none> <none>).

Eventually once crawling was done, docker ps started working.

@dancy issued a docker system prune

So looks like we really need Pipelinelib to remove intermediate images and ensure the containers are deleted after they have completed :]

@dancy issued a docker system prune

Total reclaimed space: 496.2GB

Change 755484 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] DNM: Test a failing pipeline

https://gerrit.wikimedia.org/r/755484

Change 702778 merged by jenkins-bot:

[integration/pipelinelib@master] Enforce pipefail on all run step commands

https://gerrit.wikimedia.org/r/702778

Discussed with @dduvall today. The problem is that if a pipeline step fails, the usual teardown which deletes containers doesn't run.

Change 755496 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/pipelinelib@master] PipelineRunner.run(): Forcibly remove container on exception

https://gerrit.wikimedia.org/r/755496

Change 755496 merged by jenkins-bot:

[integration/pipelinelib@master] PipelineRunner.run(): Forcibly remove container on exception

https://gerrit.wikimedia.org/r/755496

Change 755484 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] DNM: Test a failing pipeline

Reason:

done testing.

https://gerrit.wikimedia.org/r/755484

dancy claimed this task.

There are more improvements to the cleanup process to be made but https://gerrit.wikimedia.org/r/755496 takes care of most of the problem so calling this resolved for the time being. Note I only cleaned up on contint1001. Other nodes surely have some cruft that has collected.

Reopened because containers are still accumulating on contint1001:

dancy@contint1001:~$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                   PORTS               NAMES
f8205c7ea557        8499c40c2a12        "npm test"          3 hours ago         Exited (0) 3 hours ago                       plib-run-8b9w8fnr
ea0800f5ee06        e235e6093bfa        "npm test"          3 hours ago         Exited (0) 3 hours ago                       plib-run-83psyzyi
37bfc0cf5940        3d6427bdb690        "npm test"          3 hours ago         Exited (0) 3 hours ago                       plib-run-2mk5d7by
6713606c5577        74aa879958dc        "npm test"          4 hours ago         Exited (0) 4 hours ago                       plib-run-3nez5wha
672cba24efdd        02351a2e63ae        "npm test"          4 hours ago         Exited (0) 4 hours ago                       plib-run-822567it
6eea01b379c8        20ba68e216b8        "npm test"          4 hours ago         Exited (0) 4 hours ago                       plib-run-tlljoh5x
9763e0fa3f5e        b6052ae25054        "npm test"          4 hours ago         Exited (0) 4 hours ago                       plib-run-hp3n3mvn

Change 755830 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/pipelinelib@master] PipelineRunner.run(): Add removeContainer parameter

https://gerrit.wikimedia.org/r/755830

Change 756030 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] jjb/service-pipeline.groovy: Always remove test container

https://gerrit.wikimedia.org/r/756030

Change 755830 merged by jenkins-bot:

[integration/pipelinelib@master] PipelineRunner.run(): Add removeContainer parameter

https://gerrit.wikimedia.org/r/755830

Change 756030 merged by jenkins-bot:

[integration/config@master] jjb/service-pipeline.groovy: Always remove test container

https://gerrit.wikimedia.org/r/756030