Maniphest T205902

Disk-space related issues still occurring for Docker based CI jobs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dduvall
	Oct 1 2018, 5:12 PM

Description

Jenkins node integration-slave-docker-1041 was taken offline by @Krinkle over the weekend. Job failure due to lack of disk space was cited in the message.

It seems non-Quibble Docker based jobs are still leaving a full src directory after the run which shouldn't be necessary or desirable as it leads to an eventual exhaustion of the workspace directory. All remaining Docker jobs should have a post-build step added that removes the src directory following each build—and perhaps other directories as well that are not expected to store artifacts (e.g. tmp).

This round of disk-space related failures is related but not due to the same root cause as T202457: mediawiki-quibble docker jobs fails due to disk full. However, it may actually occur more frequently due to the splitting of the LVM volume group for both the /srv and /var/lib/docker logical volumes, and also due to the consolidation of executors on larger shared instances (see T202160)—more executors on a single instance means more potential for multiple dirty directories per job (i.e. {job-name}@{executor} directories).

In any case, it would seem desirable to have builds clean up all directories that aren't used to store artifacts at the end of the build process, not the beginning.

Details

	Subject	Repo	Branch	Lines +/-
	Wipe entire workspace after builds of Docker based jobs	integration/config	master	+80 -20

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		dduvall	T205902 Disk-space related issues still occurring for Docker based CI jobs
		Resolved		dduvall	T206134 Nodes taken offline after /var/lib/docker partition fills due to container logging

Event Timeline

dduvall created this task.Oct 1 2018, 5:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 1 2018, 5:12 PM

Mentioned in SAL (#wikimedia-releng) [2018-10-01T17:13:24Z] <marxarelli> bringing integration-slave-docker-1041 back online following source directory clean up (T205902)

dduvall claimed this task.Oct 1 2018, 5:15 PM

dduvall triaged this task as High priority.

dduvall moved this task from Untriaged to In-progress on the Continuous-Integration-Infrastructure board.

dduvall moved this task from Backlog to In-progress on the Release-Engineering-Team (Kanban) board.

Change 463848 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[integration/config@master] Wipe entire workspace after builds of Docker based jobs

https://gerrit.wikimedia.org/r/463848

gerritbot added a project: Patch-For-Review.Oct 1 2018, 9:30 PM

Change 463848 merged by jenkins-bot:
[integration/config@master] Wipe entire workspace after builds of Docker based jobs

https://gerrit.wikimedia.org/r/463848

dduvall closed this task as Resolved.Oct 5 2018, 10:15 PM

dduvall closed subtask T206134: Nodes taken offline after /var/lib/docker partition fills due to container logging as Resolved.