For the most part, the VMs require a failover of the databases and the DNS that points to them to reduce impact to Toolforge and CloudVPS users so that we can do maintenance on the hypervisor servers. However, there is communication to be done and specific steps to take. Those will be documented here.
**ToolsDB -- clouddb1001/2**
This has unreplicated tables requiring some special efforts and additional downtime for affected tools. Two of the unreplicated databases are only temporarily so (see T257274 and T257275), which will be cleared up by this change. It is worth it to review the others to make sure they are still active and need to remain unreplicated in case we can clear more of these barriers for the future.
[] Make announcements.
[] During downtime, stop tools connected to the unreplicated tables, including all cron jobs.
[] list the tools here
[] Dump all unreplicated tables to be restored on the other side.
[] Ensure replication looks healthy.
[] Stop puppet on both sides.
[] Fail over like https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Failing_over_Toolsdb
[] Stop mariadb on the host that is going down.
[] Stop the VM on the host that is going down.
[] Presuming this is the more difficult thing to fail back, buckle in for clouddb1002 to be the new primary if possible.
[] Establish replication back to clouddb1001 as the replica.
**OSMDB -- clouddb1003/4**
This should be a straight PostgreSQL failover.
[] Document how you do this.
[] Make announcement.
[] Validate that the replication is up to date from logs before doing the failover.
[] Stop puppet on both servers
[] Prepare DNS patch
[] Fail over, merging DNS patch as close to concurrently as you can...most users are read-only and won't notice the difference.
[] Establish the failed over host as the new primary in puppet and elsewhere to get things replicating again. Failing back will be necessary because the ToolsDB service should not be forced to.