Page MenuHomePhabricator

Test different NFS mount, export options and methods for failing over and coping with loss of a server
Open, Stalled, MediumPublic

Description

There have been several issues where an NFS server that is undergoing maintenance or during failovers has issues with locks staying open requiring a reboot of an NFS server during failover (like in the parent task), a non-active server that is used via a symlink causing anything once connected to blow up with load (even if they weren't using any files and are now unmounted) like in T196651 where whenever we shut one down, regardless of use, load skyrocketed throughout toolforge.

We need to, in a relatively safe environment, go over our options for how we can reduce downtime when an NFS server is

  1. shut down
  2. failed over

It may just be that we need a new way to fail them over, that might require unmounting everybody or using autofs.

Details

Related Gerrit Patches:
operations/puppet : productioncloudstore: remove dependency on bind mounts

Event Timeline

Bstorm triaged this task as High priority.Sep 4 2018, 3:12 PM
Bstorm created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 4 2018, 3:12 PM
Bstorm added a comment.Sep 4 2018, 3:12 PM

I'm imagining this being done in a Cloud VPS, so it should touch a lot of things.

Bstorm lowered the priority of this task from High to Medium.Sep 4 2018, 3:14 PM
Bstorm assigned this task to GTirloni.Oct 1 2018, 12:44 PM
Bstorm added a subscriber: GTirloni.

Note: @GTirloni already has created the VPS with its own puppet master and helped me test some things. So this is coming right along :)

For context, we have two ways of failing over here. One for the tools and project NFS (https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/role/templates/labs/nfs/nfs-manage.sh.erb) and one for the dumps NFS (which is a series of symlinks and notes in puppet). Neither of them work in any kind of "high availability" sense. They work insofar as you eventually have a stable environment on the other server, but getting there is likely to set off every alarm we have, call in multiple team members and take longer than 10 minutes to stabilize even with plenty of experience to draw from.

I'll provide more context here around the dumps failover since it is somewhat arcane. On dumps, we've ended up in a situation where the inactive server was down for a while, with nothing using it except having it mounted, and after we removed the entry from /etc/fstab across the board, set up a dummy interface on client servers and had long since unmounted it everywhere--clients were still crashing down (especially k8s workers--where you can imagine docker interactions were part of it, but everything else as well).

The painful experiences of acting on these HA mechanisms is why we are working in a VPS. There we should be safe enough to experiment and try things to see what will actually work. Two more drastic things jump to mind immediately for the tools-projects NFS: ganesha-nfs and using a single filesystem for nfs4 export without the bind mounts that refuse to unmount during failover. Maybe 1/3 of our problems appear to be related to backporting kernels with NFS onboard. Ganesha-nfs, while not being the maturest thing on the planet, solves that issue.

Bstorm added a comment.Oct 1 2018, 1:09 PM

The dumps failover has a web element that can be ignored for purposes of this ticket (requiring a manual hiera change to set which server gets the https cert). That said, these values are what are used to determine who has NFS: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common.yaml#332
That includes dumps_dist_active_web, which is actually which NFS server is the web server, and tells a class of servers to use it for NFS as well. All of these become the same value during a failover. If you search for these hiera values, you'll see all the symlinks and whatnot around them. Again, it doesn't work great, but it may actually be more elegant than the tools-project one, even if it caused a longer outage recently. Dumps doesn't have DRBD, thus it doesn't use the script to failover. Most of the data there is just a copy from another place (except things found in the /other folder, which are duplicated manually to both of them). The dumps servers are not backed up. The DRBD mechanism is there partly so we can snapshot and sync backups of the inactive volumes to codfw, not just so that we have an active-passive pair for HA.

Bstorm added a comment.Oct 1 2018, 1:12 PM

Oh yeah, other things we don't do here right now: pacemaker and heartbeat solutions (there was some old bad experience with them, but I'm not ruling that out) for automated failover...or at least possibly smoother failover of a more firm STONITH could be set up and autofs (no idea why we don't do autofs).

GTirloni added a comment.EditedOct 5 2018, 8:20 PM

On the topic of high load average and iowait whenever that are issues with the NFS server, one possible solution is to make things more "dynamic" by switching from hard to soft so that I/O operations will error out after timeo and retrans are reached and to change rsize/wsize to smaller values so each RPC operation won't carry a lot of data (today they are set to 1Mbyte each). The change to soft mounts would mean that applications will get errors and would be on their own to recover from it, while today with hard, we keep retrying forever and increasing the load average (since I/O is in the load average equation on Linux).

A totally non-scientific test with various rsize/wsize values between nfs-client-01 and nfs-server-01 while reading a 2GB file over NFS (iper measured 28MB/s on the network):

2k - throughput: 27MB/s, iowait: 27
4k - throughput: 28MB/s, iowait: 32
8k - throughput: 28MB/s, iowait: 37
16k - throughput: 28MB/s, iowait: 39
32k - throughput: 28MB/s, iowait: 40
64k - throughput: 28MB/s, iowait: 40
128k - throughput: 28MB/s, iowait: 40
256k - throughput: 28MB/s, iowait: 41
512k - throughput: 28MB/s, iowait: 42
1024k - throughput: 28MB/s, iowait: 41

Also included some statistics collected from mountstat on all Tools servers. It seems a 64KB (or even 32KB) rsize/wsize might be doable.

We use soft mounting in many places, but not all. It's still given us trouble, but you'll find we have kind of aggressive settings for the timeo and retrans stuff as well. Throughput measurements are awesome! Thanks. Bonny++ is also installable on our clients in the test environment if you want to use a full benchmark suite for anything. I put it on some of the clients already that I'd made.

GTirloni added a subscriber: bd808.Dec 28 2018, 6:24 PM

@Bstorm @bd808 should I continue looking into this? If so, any pointers where I should look next for improving the situation?

Bstorm added a comment.Jan 4 2019, 4:48 AM

The general notion of this ticket (although we've been distracted with the load problems on the tools/project NFS cluster) is that our current failover scheme is fundamentally broken. If you failover, you have to reboot things just to get it to let go of the bind mounts. On the dumps cluster, failing over happens via symlinks, but the mounts end up broken so badly anyway if something goes down that it's really hard to recover toolforge.

So, two thoughts:

  1. If the mount setup can be done in a fashion that fails over better on a clone of the tools/project cluster, that might be nice. As is, the bind mounts on labstore1004/5 refuse to let go, so you have to reboot the node and do a pretty ugly failover (usually after lots of alarms have gone off). If all the exports were in the same filesystem, then we don't need the bind mounts, for instance. However, there might be some issues with that (since we have no quotas, for example).
  2. This whole ticket might be obsoleted by CephFS...and the problems with dumps failover might be resolved somewhat by k8s updates (which are, indeed, in our next quarter goals). Our NFS failover story is just pretty unfortunate (as much as the problems with IO and load on the main cluster).
Bstorm added a comment.Jan 4 2019, 4:58 AM

To put it another way, if you find something cool to fix this using some of the test environment we set up, it'd be great (by simplifying the mounts, finding some hidden options or trying a different DRBD module or some such nonsense), but if CephFS does it better in the course of that testing, let's put the effort there instead. That would be a subtask of this no matter what, just in case it fixed this along with other issues.

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:11 PM
GTirloni removed GTirloni as the assignee of this task.Mar 23 2019, 6:43 PM
GTirloni added a subscriber: GTirloni.
GTirloni removed a subscriber: GTirloni.
bd808 moved this task from Backlog to Shared Storage on the Data-Services board.May 30 2019, 7:04 PM
Bstorm changed the task status from Open to Stalled.Aug 6 2019, 1:55 PM

We are not really working on things like this in favor of ceph for now.

Change 571821 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821

Change 571821 merged by Bstorm:
[operations/puppet@production] cloudstore: remove dependency on bind mounts

https://gerrit.wikimedia.org/r/571821