A single user heavily using NFS basically takes down all of Labs. It should be far more robust than that.
At the time this happened I looked at https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring and the inbound traffic on labstore1001 nearly hit 100MBps a few times in spikes while the IO was stuck at 100% during the same time.
The following are my thoughts regarding trying to make this type of setup harder to accidentally take down by a single user. I have not verified my assumptions again and my information regarding the state of this in Linux is a few years old.
As the bottleneck is disk IO the best solution to fairly share it between users. This is made difficult by the Linux NFS client and server implementation, because ionice hints are discarded by the client and the server does not give the IO-scheduler the information to distinguish between different clients/users.
It might be possible that there is a different thread used for each client/user on the server then one could try to put each into a different IO-scheduler group.
Another idea is to make network-IO the bottleneck to be able introduce fairness at that point. Traffic shaping the maximum incoming network bandwidth on each client connection might help. Or limiting the overall incoming network bandwidth and then fairly sharing (with e.g. HFSC) over each NFS connection.
I wonder how Ceph would compare in this case...
NFS indeed does not allow us to know which enduser is responsible for any specific traffic, as an unavoidable consequence of the levels of abstraction that are in play* Regardless, it may well be possible to shape traffic at the /instance/ resolution which, while not granular enough to do a per-user fair distribution of resources, would - generally - still allow us to prevent a single errant operation from affecting all of labs.
This wouldn't be perfect as the relationship between NFS network traffic and actual IO bandwidth on the server is not 1:1, but it would have prevented the specific issue that was the cause of the brief outage that prompted this ticket. In addition, to be useful, the rate limit needs to be somewhat conservative and so there would be a noticable impact on general NFS speed even when IO bandwidth has not reached its limit.
All of that said, some of those downsides may well be blessings in disguise. Inter alia, discouraging very expensive NFS I/O in favor of instance-local storage or forms of storage other than the filesystems (databases, object storage) is likely to be beneficial in the longer term because those all scale better.
- This is not /strictly/ true as the file operations are associated with a uid at the vfs level, but there are no hook in place that would allow us to rate-limit at that level.
I believe with the rollout of https://gerrit.wikimedia.org/r/#/c/272900/ this has improved greatly. It is not necessarily difficult to overwhelm the modest NFS setup but it is certainly less easy and requires concerted load among several clients. I am going to close this for now though there is more work ongoing to improve the situation.