Page MenuHomePhabricator

Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5
Open, MediumPublic

Description

On reviewing the setup obsessively, I realized that a locking issue will prevent smooth failover for maps in the event of an NFS server failover. As mentioned in T203469, NFS failover is not very good without shared cluster filesystems, so this isn't the only place where that is true. Currently I will be documenting the failover as suggesting that the maps servers be rebooted to clean up after.

Because the failover of the primary cluster now works, this task is now to go to that model. The steps are:

  1. Evaluate and fix the issue of having a DRBD network interface/IP (significant work here)
  2. Get all the monitoring and scripts in place for DRBD
  3. Set the standby server to be a DRBD primary system as an rsync target
  4. Move scratch to use the cluster IP if not already there
  5. Fail over to the DRBD primary node
  6. Stop/remove the rsync process
  7. Set up the DRBD secondary (destructive step) and connect the cluster

Event Timeline

Bstorm triaged this task as Medium priority.May 31 2019, 6:46 PM
Bstorm created this task.
Bstorm added a comment.Jun 5 2019, 6:38 PM

Active/active drbd requires detached storage. Overall, failover with NFS attached is never going to look very good. A clever solution may be possible once ceph is deployed.

Bstorm changed the task status from Open to Stalled.Aug 6 2019, 1:57 PM

Waiting on this in hopes that maybe one of the two volumes could be replaced with something else, like a cephfs.

Bstorm removed Bstorm as the assignee of this task.Sep 25 2019, 3:54 PM

Removing myself because cookie-licking is bad when I'm not working on it.

Bstorm changed the task status from Stalled to Open.Jun 11 2020, 11:17 PM

I think we know two things about this right now:

  1. This isn't going on cephfs directly.
  2. We should convert this cluster to using DRBD like the primary NFS system. It will work so much better.
Bstorm claimed this task.Jun 22 2020, 10:59 PM
Bstorm moved this task from Soon! to Doing on the cloud-services-team (Kanban) board.

Change 607142 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] cloud nfs: clean up some of the secondary cluster materials

https://gerrit.wikimedia.org/r/607142

Bstorm renamed this task from Improve the failover mechanism for maps on cloudstore1008/9 to Move maps and scratch on cloudstore1008/9 to a DRBD failover similar to labstore1004/5.Jun 22 2020, 11:08 PM
Bstorm updated the task description. (Show Details)

Change 607142 merged by Bstorm:
[operations/puppet@production] cloud nfs: clean up some of the secondary cluster materials

https://gerrit.wikimedia.org/r/607142