On reviewing the setup obsessively, I realized that a locking issue will prevent smooth failover for maps in the event of an NFS server failover. As mentioned in T203469, NFS failover is not very good without shared cluster filesystems, so this isn't the only place where that is true. Currently I will be documenting the failover as suggesting that the maps servers be rebooted to clean up after.
Because the failover of the primary cluster now works, this task is now to go to that model. The steps are:
- Evaluate and fix the issue of having a DRBD network interface/IP (significant work here)
- Get all the monitoring and scripts in place for DRBD
- Set the standby server to be a DRBD primary system as an rsync target Once the puppet and interface are all set up:
- stop nfs and unmount the volume (scratch is our example)
- destroy the volume just enough to get DRBD to create metadata dd if=/dev/zero of=/dev/srv/scratch bs=1M count=128
- drbdadm create-md scratch
- Since this is a standalone for now, drbd up scratch and then drbd disconnect scratch just to be safe
- drbdadm primary scratch --force is needed to be able to use/mount the volume
- you should now be able to create a filesystem (remember, you destroyed it) mkfs.ext4 /dev/drbd2
- remove from /etc/fstab and mount by hand mount -o noatime /dev/drbd2 /srv/scratch
- initial rsync
- stop the constant rsync after success
- Move scratch to use the cluster IP
- Fail over to the DRBD primary node
- Stop/remove the rsync process
- Set up the DRBD secondary (destructive step) and connect the cluster