When projects get too big, they should be moved to their own volumes and sharded appropriately. We should setup reporting and alerting for when projects in 'others' exceed a certain size (or tools / maps are about to fill up) and make sure we take care of them appropriately. This is the alternative to quotas, which we don't want to do.
|Resolved||yuvipanda||T105720 Labs team reliability goal for Q1 2015/16|
|Open||None||T107066 Measure capacity and utilization of labs services (Tracking)|
|Declined||coren||T106476 Setup monitoring and reporting for disk space usage of each project on NFS|
I have a working script right now in my dev VM which is able to surface outliers in disk usage based on UID at very little I/O cost by using quota tracking (based on repquota). What remains is to make a NRPE check out of that.
Thankfully, the filesystems were created with the quota inode (in case we ever needed to turn quotas on) so that can be used out of the box.
It turns out that the scheme I had thought of is considerably less useful than I had initially thought: while it does an excellent job of assessing disk usage of service groups, like the tools project is set up, the lack of consistency of group ownership for files in the storage of other projects makes it unhelpful for evaluating their usage.
It might be possible to recover that technique if we ever want to make usage soft quota for the tools project, but as a method to evaluate when other projects should be spun off the others storage, it's not working.
I'm just done doing a count of disk usage for the others filesystem and the du alone takes several hours of heavy disk I/O, making it unworkable as an icinga check (at least directly).
Current working theory: doing an asynchronous du with very low ionice, project directory by project directory. Store the result in a scoreboard, and have an icinga check of that at regular interval. Downside: it may take a long time to do a "round" of du checks, meaning it might take very many hours - possibly days - before we notice outliers. Possibly not a serious issue in practice.
 tools uses service groups for ownership of files with a sgid directory in the root for each tool - gid quota gives us a near-perfect usage count for every tool. The only way this could be extended to general projects is by having the project filesystem's root sgid to the project-* group - but that is disruptive at best (changes required permissions) or broken at worse (as current usage pattern sometimes require group ownership by system groups)
What about not monitoring disk usage after the fact, but instead (always) creating an volume at project creation? The project owners would have to specify an estimated disk usage and it would be up to them to monitor (with shinken, manually, etc.) their project volume. If they run out of space, the project volume can be (manually) enlarged on request.
After consideration, the occurrence of a project using a large fraction of the others filesystem should be rare enough that we can simply investigate which if and when the filesystem itself is getting low in space (especially as new projects do not use NFS by default).