For the most part, the VMs require a failover of the databases and the DNS that points to them to reduce impact to Toolforge and CloudVPS users so that we can do maintenance on the hypervisor servers. However, there is communication to be done and specific steps to take. Those will be documented here.
ToolsDB -- clouddb1001/2
This has unreplicated tables requiring some special efforts and additional downtime for affected tools. Two of the unreplicated databases are only temporarily so (see T257274 and T257275), which will be cleared up by this change. It is worth it to review the others to make sure they are still active and need to remain unreplicated in case we can clear more of these barriers for the future.
When upgrading clouddb1002's hypervisor (cloudvirt1020), we will need to stop replication on clouddb1002 and stop mariadb before the downtime, but users won't be impacted.
- Make announcements.
- Put PAWS into sqlite mode so it keeps chugging.
- During maintenance window, set ToolsDB to be read-only
- Dump all unreplicated tables to be restored on the other side. (Refer to https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Re-importing_data_after_replication_failures)
- Ensure replication looks healthy.
- Stop puppet on both sides.
- Merge patch that removes temporarily unreplicated tables.
- Import unreplicated tables in clouddb1002
- Save aside the needed replication settings
- Fail over like https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Toolsdb#Failing_over_Toolsdb
- Set PAWS to use toolsdb again
- Stop mariadb on the host that is going down.
- Stop the VM on the host that is going down (in this case that should also include clouddb1004 which is the OSMBD replica).
- At this point, re-image of cloudvirt1019 can proceed.
- Buckle in for clouddb1002 to be the new primary if possible.
- Establish replication back to clouddb1001 as the replica.
OSMDB -- clouddb1003/4
Instead of failover, let's just upgrade this with the DB down. I don't think there is as strong an expectation of uptime for this service.
- Stop the database and shut down the VM