Page MenuHomePhabricator

Pipeline lib still leaks containers on contint1001 / contint2001
Closed, ResolvedPublic

Description

T290437 got filed by SRE after the monthly backup showed it had million more files than usual. Looking at inodes:

contint1001:~$ df -hi
Filesystem           Inodes IUsed IFree IUse% Mounted on
...
/dev/mapper/vg0-srv     89M   19M   71M   21% /srv
                             ^^^^^

There are 533 left over containers there that should have been ripped off.

We had the issue a year or so ago (T235680) and recently due with images due to a fault in pipelinelib T284125 . But this time it is with containers.

On contint1001 docker ps -a shows more than 500 dangling containers all named plib-run-XXXX.

Event Timeline

Change 702778 had a related patch set uploaded (by Thcipriani; author: Dduvall):

[integration/pipelinelib@master] Enforce pipefail on all run step commands

https://gerrit.wikimedia.org/r/702778

contint1001 went unresponsive today. A series of change got send that triggers pipeline builds and eventually they went causing way too much IO / CPU / memory. That got solved by powercycling the machine T299542

Once back, docker ps did not work much.

dockerd was busy crawling files under /srv/docker/image/overlay2/imagedb/content/sha256/ which has 32476 directories.

One of those files refers to a container created in August 2021. There are also a lot of intermediate images (<none> <none>).

Eventually once crawling was done, docker ps started working.

@dancy issued a docker system prune

So looks like we really need Pipelinelib to remove intermediate images and ensure the containers are deleted after they have completed :]

@dancy issued a docker system prune

Total reclaimed space: 496.2GB

Change 755484 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] DNM: Test a failing pipeline

https://gerrit.wikimedia.org/r/755484

Change 702778 merged by jenkins-bot:

[integration/pipelinelib@master] Enforce pipefail on all run step commands

https://gerrit.wikimedia.org/r/702778

Discussed with @dduvall today. The problem is that if a pipeline step fails, the usual teardown which deletes containers doesn't run.

Change 755496 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/pipelinelib@master] PipelineRunner.run(): Forcibly remove container on exception

https://gerrit.wikimedia.org/r/755496

Change 755496 merged by jenkins-bot:

[integration/pipelinelib@master] PipelineRunner.run(): Forcibly remove container on exception

https://gerrit.wikimedia.org/r/755496

Change 755484 abandoned by Ahmon Dancy:

[mediawiki/tools/scap@master] DNM: Test a failing pipeline

Reason:

done testing.

https://gerrit.wikimedia.org/r/755484

dancy claimed this task.

There are more improvements to the cleanup process to be made but https://gerrit.wikimedia.org/r/755496 takes care of most of the problem so calling this resolved for the time being. Note I only cleaned up on contint1001. Other nodes surely have some cruft that has collected.