Page MenuHomePhabricator

Plan to make clouddumps more resilient and easier to operate
Open, MediumPublic

Description

The following is a plan to make clouddumps reliability better, with priority TBD and the following goals:

  1. clouddumps hosts reboots should be a non-event
  2. Failover should be simple, quick, reliable and straightforward to do

The current situation is that clouddumps serve these purposes: NFS server towards Cloud VPS and production (wdqs, stat hosts), plus web serving of dumps.w.o. The active host for each is controlled by Puppet via dumps_dist_active_vps and dumps_dist_active_web respectively; failover is performed according to https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Dumps#Failover .

I (Filippo) propose the following: clouddumps service refers to a new single floating IP address (e.g. announced by BGP on the host) that lives on the active server. In this scenario failover happens on the server, as opposed to clients, and is still controlled by Puppet. When failover happens, NFS clients will reconnect automatically once the NFS server moves over, and similarly once the IP moves then clients will follow (i.e. no CNAME flipping). NFS clients will mount a single NFS share pointing to said IP, and ditto for web clients from dumps.w.o.

In practice failover would look like the following:

  1. We flip the puppet variable to point to the new active host
  2. Puppet runs on the now-active host, IP moves and the NFS server is started. Puppet enables the required systemd units to be active now and at boot
  3. Puppet runs on the previously-active host, stops NFS server and stops announcing the IP address. Also disables systemd units so they don't come up at boot

There is an obvious race between puppet runs, which we break by making IP announcement + NFS server start depend on being able to acquire an etcd lock. We effectively serialize Puppet runs to wait for the lock to be released on the previously-active host before moving over the IP address and start NFS server. Doing serialization this way also takes care of the situation when the active host is hard-down for long periods of time e.g. hardware failure and we failover while the active host is down:

  1. Active host is hard-down
  2. We failover in puppet, the lock can be acquired and thus IP + NFS move over on the now-active host
  3. At some point the previously-active host comes back before Puppet has a chance to run, tries to acquire the lock and fails
  4. Puppet now has a chance to run on the host that just came back and disables the IP + NFS server units for good

We do depend on existing etcd for failover to happen, although I think that is acceptable given etcd's track record in production. We will also devise mechanisms to easily override the lock in case of manual failover.

As far as reboots are concerned, rebooting the passive host does not affect clients by definition. While rebooting the active host should result in a brief downtime for clients, which will resolve itself once the host is back. Assuming we want to "eat" the down time and not perform a failover when we need to do rolling reboots.

For additional context, I took inspiration from the problems outlined in T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge

Event Timeline

taavi triaged this task as Medium priority.Dec 4 2025, 11:45 AM

Update from network sync meeting: hosting read-only NFS behind LVS, like we're going to do for http/rsync (T306550) should be explored as a solution, which is going to be easier and simpler to maintain as opposed to the shared IP. I need to investigate/test more though at least conceptually read only NFS should be fine from the client's POV