Page MenuHomePhabricator

compiler1003.puppet-diffs.eqiad1.wikimedia.cloud out of disk space
Closed, ResolvedPublic

Description

===== NODE GROUP =====
(1) compiler1003.puppet-diffs.eqiad1.wikimedia.cloud
----- OUTPUT of 'df -h /srv' -----
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk   60G   57G     0 100% /srv
===== NODE GROUP =====
(1) compiler1002.puppet-diffs.eqiad1.wikimedia.cloud
----- OUTPUT of 'df -h /srv' -----
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk   60G   54G  3.5G  94% /srv
===== NODE GROUP =====
(1) compiler1001.puppet-diffs.eqiad1.wikimedia.cloud
----- OUTPUT of 'df -h /srv' -----
Filesystem                          Size  Used Avail Use% Mounted on
/dev/mapper/vd-second--local--disk   60G   43G   15G  76% /srv
================

this is making patches fail: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/1060/console

Event Timeline

taavi triaged this task as High priority.Nov 7 2021, 2:24 PM
taavi created this task.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is a recurring problem, see for example T273599 T222072 T295253. T222075 has ideas on how to tackle the issue. I've tried to access the instance to free up some space but I don't seem to have the permissions to do so.

I cleaned up old jobs from /srv/jenkins-workspace/puppet-compiler. The cleanup jobs delete-old-output-files.service and delete-old-output-large-reports.service don't seem to running for some reason:

# systemctl list-timers
NEXT                         LEFT          LAST                         PASSED               UNIT                                            ACTIVATES
n/a                          n/a           Fri 2021-02-26 20:52:32 UTC  8 months 11 days ago delete-old-output-files.timer                   delete-old-output-files.service
n/a                          n/a           Fri 2021-02-26 20:52:32 UTC  8 months 11 days ago delete-old-output-large-reports.timer           delete-old-output-large-reports.service

after joe's patch the timers are now working again. I have also created T295284 to fix other occurrences of this issue and to try and get some testing for this in the define its self.

$ systemctl list-timers |
Tue 2021-11-09 11:09:05 UTC  21h left      Fri 2021-02-26 20:52:31 UTC  8 months 11 days ago delete-old-output-large-reports.timer           delete-old-output-large-reports.service
Tue 2021-11-09 13:14:00 UTC  23h left      Fri 2021-02-26 20:52:31 UTC  8 months 11 days ago delete-old-output-files.timer                   delete-old-output-files.service

@ema I have added you to the project now so you should be able to access theses

taavi claimed this task.