Page MenuHomePhabricator

Consider phasing out maintenance-sample-workspace-sizes
Closed, ResolvedPublic

Description

The maintenance-sample-workspace-sizes Jenkins job has been introduced ( T258626 ) to identify an issue with CI agent partitions filing up ( T258972 ).

Every minute, on all agents, it runs du against the workspaces and send the result to statsd / graphite. The exact command:

find /srv/jenkins/workspace/workspace -mindepth 1 -maxdepth 1 -type d -exec du -s {} ;

That has let us identify a few job that caused a huge workspace, notably due to the large cache being restored from Castor. The job definitely helped.

@hashar think that running du on all agents every minute might stress I/O (no metrics to back up that claim though, it is an assumption).

This task is to discuss whether we want to keep that job to run every minute. Alternatives:

  • run it less often
  • collect on job completion before files get deleted

Event Timeline

The task follow up a quick chat I had with @dancy. I really really like that feature, I am afraid might not be sustainable IO wise, although I can't see anything indicating that causes stress, probably cause all the files are already in the memory disk cache.

Change 646683 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] jjb: remove maintenance-sample-workspace-sizes

https://gerrit.wikimedia.org/r/646683

Change 646683 merged by jenkins-bot:
[integration/config@master] jjb: disable maintenance-sample-workspace-sizes

https://gerrit.wikimedia.org/r/646683

thcipriani assigned this task to hashar.