Page MenuHomePhabricator

compiler1002.puppet-diffs.eqiad.wmflabs disk is full
Closed, ResolvedPublic

Description

Building remotely on compiler1002.puppet-diffs.eqiad.wmflabs (puppet-compiler-node) in workspace /srv/jenkins-workspace/workspace/operations-puppet-catalog-compiler

[ 2019-04-29T13:57:27 ] INFO: Creating directories under /srv/jenkins-workspace/puppet-compiler
error: copy-fd: write returned: No space left on device
fatal: failed to copy file to '/srv/jenkins-workspace/puppet-compiler/16157/production/src/.git/objects/pack/pack-98408806b04fe7d669426d6ce40e0d2c3be9d1c1.pack': No space left on device

Can't SSH into the machine, not sure if it's related, maybe I just don't have access:

$ ssh compiler1002.puppet-diffs.eqiad.wmflabs
Connection closed by UNKNOWN port 65535

Event Timeline

Gilles updated the task description. (Show Details)
Gilles updated the task description. (Show Details)

I've cleaned outputs older than 31 days, that gave us almost 5G:

root@compiler1002:/srv/jenkins-workspace/puppet-compiler/output# find ./ -type d -ctime +31 -maxdepth 1 -exec rm -rf {} +
root@compiler1002:/srv/jenkins-workspace/puppet-compiler/output# df -h
Filesystem                          Size  Used Avail Use% Mounted on
udev                                3.9G     0  3.9G   0% /dev
tmpfs                               799M   83M  717M  11% /run
/dev/vda3                            19G  3.2G   15G  19% /
tmpfs                               4.0G  4.0K  4.0G   1% /dev/shm
tmpfs                               5.0M     0  5.0M   0% /run/lock
tmpfs                               4.0G     0  4.0G   0% /sys/fs/cgroup
/dev/mapper/vd-second--local--disk   60G   52G  4.8G  92% /srv
Gilles claimed this task.
hashar subscribed.

That is just a workaround, /srv already went from 4.8G available down to 1.7G.

Out of the 60G partition 40G are used by the build reports stored in /srv/jenkins-workspace/puppet-compiler/output with some taking 1GB each. They are served publicly via https://puppet-compiler.wmflabs.org/compiler1002/

Apparently due to .pson files being rather large. They are pretty printed json of host catalog for current production and for the change. So that grows up fairly fast whenever a compilation is made for several hosts.

I guess we would want:

  • to have the CI job or puppet-compiler to gzip those pson files
  • add a puppet tidy class to purge old artifacts
Dzahn triaged this task as Medium priority.May 1 2019, 5:44 PM

Change 507623 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] puppet_compiler: add cron to delete old output files

https://gerrit.wikimedia.org/r/507623

I agree and think this is a good step to address this.

In addition I think we'll want to implement compression as @hashar suggests, and/or deploy larger /srv filesystems on the worker hosts. Since compiling for a large number of hosts can create output directories up to ~1G in size, and enough of these could be created within a 30 day window to fill the filesystem.

On paper this use case also would lend itself to a filesystem with transparent compression. Maybe btrfs with compression. The data stored on disk is non-critical, and there are multiple worker nodes should issues arise with one filesystem.

Change 507623 merged by Dzahn:
[operations/puppet@production] puppet_compiler: add cron to delete old output files

https://gerrit.wikimedia.org/r/507623

cron deployed on compiler1002 and tested to run manually. works. removed about 2G (from 44G to 42G size of the output dir).

is there more to the ticket now?

  • to have the CI job or puppet-compiler to gzip those .pson files
  • add a puppet tidy class to purge old artifacts

So either have the Jenkins jobs to compress artifacts or have puppet-compiler to do it.