Page MenuHomePhabricator

Prevent puppet catalog compiler workers from running out of disk space
Closed, ResolvedPublic0 Estimated Story Points

Description

Today one of the compiler workers ran out of disk space, and the only indicator was failed job output.

To address this I can think of a few actions:

  • Schedule a cleanup job to remove old directories from the jenkins workspace
  • Add disk capacity to the compiler workers (which afaik in openstack means adding cpu/ram resources as well)
  • Alert on PCC worker node filesystem near full
  • Have the CI job or puppet compiler to compile the huge .pson catalog

Event Timeline

herron triaged this task as Medium priority.Apr 29 2019, 3:09 PM
herron created this task.

This happened again the other day and made me mail the SRE list. Then I added docs how to cleanup here:

https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs/Documentation#Out_of_disk_space

Change 830957 had a related patch set uploaded (by RLazarus; author: RLazarus):

[operations/puppet@production] pcc: Warn before compiling all nodes by default

https://gerrit.wikimedia.org/r/830957

Change 830957 merged by RLazarus:

[operations/puppet@production] pcc: Warn before compiling all nodes by default

https://gerrit.wikimedia.org/r/830957

Change 852280 had a related patch set uploaded (by Jbond; author: John Bond):

[operations/software/puppet-compiler@master] worker: store catalogs as gziped file

https://gerrit.wikimedia.org/r/852280

Change 852280 merged by jenkins-bot:

[operations/software/puppet-compiler@master] worker: store catalogs as gziped file

https://gerrit.wikimedia.org/r/852280

jbond claimed this task.

I have no preformed the following actions

  • move all reports to a shared NFS mount on puppet-diffs-nfs-1.puppet-diffs.eqiad1.wikimedia.cloud using an open-stack volume which should make adding capacity easier and also mean we only need to fix things in one place
  • Store all pson catalogues as gzipped files

Along with this we have for some time had cron jobs to delete old reports as such im going to close this but please reopen if there are still issues or further improvements please re-open with an update