On reviewing the setup obsessively, I realized that a locking issue will prevent smooth failover for maps in the event of an NFS server failover. As mentioned in T203469, NFS failover is not very good without shared cluster filesystems, so this isn't the only place where that is true. Currently I will be documenting the failover as suggesting that the maps servers be rebooted to clean up after.
Because the failover of the primary cluster now works, this task is now to go to that model. The steps are:
- Evaluate and fix the issue of having a DRBD network interface/IP (significant work here)
- Get all the monitoring and scripts in place for DRBD
- Set the standby server to be a DRBD primary system as an rsync target
- Move scratch to use the cluster IP if not already there
- Fail over to the DRBD primary node
- Stop/remove the rsync process
- Set up the DRBD secondary (destructive step) and connect the cluster