Page MenuHomePhabricator

CI failing with "No space left on device" (debian-glue)
Closed, ResolvedPublic

Description

Two CI failures on a debian-glue build:

From one of them:

00:26:15.816 {standard input}: Fatal error: can't close unit-tests/test_NextHopStrategyFactory-test_NextHopStrategyFactory.o: No space left on device

I see T285942 which might be related, but I didn't want to rm anything without running it by you. Thank you!

From irc #wikimedia-releng, Jenkins (as wmf-insecte) complained about one of the host:

16:05:36 <wmf-insecte> maintenance-disconnect-full-disks build 500282 integration-agent-pkgbuilder-1001 (/: 33%, /srv: 98%): OFFLINE due to disk space
16:10:37 <wmf-insecte> maintenance-disconnect-full-disks build 500283 integration-agent-pkgbuilder-1001 (/: 33%, /srv: 73%): RECOVERY disk space OK

Event Timeline

ssingh triaged this task as Medium priority.Jun 14 2023, 8:47 PM
Legoktm renamed this task from CI failing with "No space left on device" (debian-gule) to CI failing with "No space left on device" (debian-glue).Jun 15 2023, 1:42 AM
hashar subscribed.

That is a recurring issue cause the Jenkins jobs are running on static hosts which are not always entirely cleared up after a build has completed. Previously we used one off virtual machine to ensure a clean state, but that caused other troubles (it is a long story).

The debian-glue* jobs runs on Jenkins agent with the DebianGlue label. They use cowbuilder and images generated by Puppet package_builder. There are a few things that can overflow the disk space:

  • New cow images being added for Bookworm?
  • A Debian package having a large disk space requirement (eg: typically Texlive but I don't think we have forked it)
  • The apt cache overflowing, I don't think it is garbage collected

The apt cache overflowing, I don't think it is garbage collected

/srv is 21G on the instances and:

Disk size in MBDirectory
11109/srv/pbuilder/aptcache
10876/srv/pbuilder/aptcache

Mentioned in SAL (#wikimedia-releng) [2023-06-15T15:06:25Z] <hashar> integration-agent-pkgbuilder-1001 and integration-agent-pkgbuilder-1002: clearing pbuilder apt cache: rm /srv/pbuilder/aptcache/*/*.deb # T339171

hashar added a project: ci-test-error.

I have manually deleted the apt caches which were taking half of the disk space and are never purged. I have filed T339251 to have the purge to happen automatically.