Page MenuHomePhabricator

One instance hammering on NFS should not make it unavailable to everyone else
Closed, ResolvedPublic

Description

A single user heavily using NFS basically takes down all of Labs. It should be far more robust than that.

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
yuvipanda added subscribers: yuvipanda, coren.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 9:25 PM
Ricordisamoa added a subscriber: Ricordisamoa.

At the time this happened I looked at https://grafana.wikimedia.org/#/dashboard/db/labs-monitoring and the inbound traffic on labstore1001 nearly hit 100MBps a few times in spikes while the IO was stuck at 100% during the same time.

The following are my thoughts regarding trying to make this type of setup harder to accidentally take down by a single user. I have not verified my assumptions again and my information regarding the state of this in Linux is a few years old.

As the bottleneck is disk IO the best solution to fairly share it between users. This is made difficult by the Linux NFS client and server implementation, because ionice hints are discarded by the client and the server does not give the IO-scheduler the information to distinguish between different clients/users.

It might be possible that there is a different thread used for each client/user on the server then one could try to put each into a different IO-scheduler group.

Another idea is to make network-IO the bottleneck to be able introduce fairness at that point. Traffic shaping the maximum incoming network bandwidth on each client connection might help. Or limiting the overall incoming network bandwidth and then fairly sharing (with e.g. HFSC) over each NFS connection.

I wonder how Ceph would compare in this case...

coren added a comment.Apr 15 2015, 3:17 PM

NFS indeed does not allow us to know which enduser is responsible for any specific traffic, as an unavoidable consequence of the levels of abstraction that are in play* Regardless, it may well be possible to shape traffic at the /instance/ resolution which, while not granular enough to do a per-user fair distribution of resources, would - generally - still allow us to prevent a single errant operation from affecting all of labs.

This wouldn't be perfect as the relationship between NFS network traffic and actual IO bandwidth on the server is not 1:1, but it would have prevented the specific issue that was the cause of the brief outage that prompted this ticket. In addition, to be useful, the rate limit needs to be somewhat conservative and so there would be a noticable impact on general NFS speed even when IO bandwidth has not reached its limit.

All of that said, some of those downsides may well be blessings in disguise. Inter alia, discouraging very expensive NFS I/O in favor of instance-local storage or forms of storage other than the filesystems (databases, object storage) is likely to be beneficial in the longer term because those all scale better.

  • This is not /strictly/ true as the file operations are associated with a uid at the vfs level, but there are no hook in place that would allow us to rate-limit at that level.
chasemp triaged this task as High priority.Nov 30 2015, 5:23 PM
chasemp added a subscriber: chasemp.

on vm usage patterns for write

https://phabricator.wikimedia.org/T126083#2004459

I am looking at shaping traffic using tc to some extent (though pretty liberally) to keep outliers within boundaries our current NFS setup can sustain

chasemp closed this task as Resolved.Mar 2 2016, 11:37 PM

I believe with the rollout of https://gerrit.wikimedia.org/r/#/c/272900/ this has improved greatly. It is not necessarily difficult to overwhelm the modest NFS setup but it is certainly less easy and requires concerted load among several clients. I am going to close this for now though there is more work ongoing to improve the situation.