Setup monitoring and reporting for disk space usage of each project on NFS
Closed, DeclinedPublic


When projects get too big, they should be moved to their own volumes and sharded appropriately. We should setup reporting and alerting for when projects in 'others' exceed a certain size (or tools / maps are about to fill up) and make sure we take care of them appropriately. This is the alternative to quotas, which we don't want to do.

yuvipanda updated the task description. (Show Details)
yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda added a project: Cloud-Services.
yuvipanda added subscribers: coren, mark, Ricordisamoa and 2 others.
coren added a comment.Jul 27 2015, 2:43 PM

We may still want to turn quotas on to do the monitoring itself (just don't put quotas) - a du over millions of files is very I/O intensive, and very long.

coren moved this task from To Do to Doing on the Labs-Sprint-107 board.Jul 29 2015, 2:35 PM
coren claimed this task.Aug 3 2015, 5:35 PM
coren added a project: Labs-Sprint-108.
coren moved this task from To Do to Doing on the Labs-Sprint-108 board.Aug 10 2015, 4:31 PM
coren added a comment.Aug 10 2015, 4:35 PM

I have a working script right now in my dev VM which is able to surface outliers in disk usage based on UID at very little I/O cost by using quota tracking (based on repquota). What remains is to make a NRPE check out of that.

Thankfully, the filesystems were created with the quota inode (in case we ever needed to turn quotas on) so that can be used out of the box.

coren moved this task from To Do to Doing on the Labs-Sprint-109 board.Aug 10 2015, 5:57 PM
coren moved this task from To do to Doing on the labs-sprint-113 board.Sep 10 2015, 7:12 PM
coren added a comment.Sep 11 2015, 1:07 PM

It turns out that the scheme I had thought of is considerably less useful than I had initially thought: while it does an excellent job of assessing disk usage of service groups, like the tools project is set up[1], the lack of consistency of group ownership for files in the storage of other projects makes it unhelpful for evaluating their usage.

It might be possible to recover that technique if we ever want to make usage soft quota for the tools project, but as a method to evaluate when other projects should be spun off the others storage, it's not working.

I'm just done doing a count of disk usage for the others filesystem and the du alone takes several hours of heavy disk I/O, making it unworkable as an icinga check (at least directly).

Current working theory: doing an asynchronous du with very low ionice, project directory by project directory. Store the result in a scoreboard, and have an icinga check of that at regular interval. Downside: it may take a long time to do a "round" of du checks, meaning it might take very many hours - possibly days - before we notice outliers. Possibly not a serious issue in practice.

[1] tools uses service groups for ownership of files with a sgid directory in the root for each tool - gid quota gives us a near-perfect usage count for every tool. The only way this could be extended to general projects is by having the project filesystem's root sgid to the project-* group - but that is disruptive at best (changes required permissions) or broken at worse (as current usage pattern sometimes require group ownership by system groups)

scfc added a subscriber: scfc.Sep 11 2015, 2:49 PM

What about not monitoring disk usage after the fact, but instead (always) creating an volume at project creation? The project owners would have to specify an estimated disk usage and it would be up to them to monitor (with shinken, manually, etc.) their project volume. If they run out of space, the project volume can be (manually) enlarged on request.

coren moved this task from Doing to Done on the labs-sprint-113 board.Sep 14 2015, 4:36 PM
coren closed this task as Declined.

After consideration, the occurrence of a project using a large fraction of the others filesystem should be rare enough that we can simply investigate which if and when the filesystem itself is getting low in space (especially as new projects do not use NFS by default).