Page MenuHomePhabricator

Disk-space related issues still occurring for Docker based CI jobs
Closed, ResolvedPublic

Description

Jenkins node integration-slave-docker-1041 was taken offline by @Krinkle over the weekend. Job failure due to lack of disk space was cited in the message.

It seems non-Quibble Docker based jobs are still leaving a full src directory after the run which shouldn't be necessary or desirable as it leads to an eventual exhaustion of the workspace directory. All remaining Docker jobs should have a post-build step added that removes the src directory following each build—and perhaps other directories as well that are not expected to store artifacts (e.g. tmp).

This round of disk-space related failures is related but not due to the same root cause as T202457: mediawiki-quibble docker jobs fails due to disk full. However, it may actually occur more frequently due to the splitting of the LVM volume group for both the /srv and /var/lib/docker logical volumes, and also due to the consolidation of executors on larger shared instances (see T202160)—more executors on a single instance means more potential for multiple dirty directories per job (i.e. {job-name}@{executor} directories).

In any case, it would seem desirable to have builds clean up all directories that aren't used to store artifacts at the end of the build process, not the beginning.

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2018-10-01T17:13:24Z] <marxarelli> bringing integration-slave-docker-1041 back online following source directory clean up (T205902)

dduvall triaged this task as High priority.

Change 463848 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[integration/config@master] Wipe entire workspace after builds of Docker based jobs

https://gerrit.wikimedia.org/r/463848

Change 463848 merged by jenkins-bot:
[integration/config@master] Wipe entire workspace after builds of Docker based jobs

https://gerrit.wikimedia.org/r/463848