The current process for master-replica switchover e.g. https://phabricator.wikimedia.org/T409818 requires multiple manual steps.
This task is to add automation to:
- Add new safety checks around host and section health
- Replace copypasting during the switchover process
- Leverage data now available in the Zarcillo DB
- Enable extensive unit/functional test (pytest)
- Enable end-to-end integration tests on testbed T400056
- Have it fully documented on wikitech so it can be used by any op
- have a dry-run feature where it goes over each step, but doesn't really change anything
- Provide timestamps for each step executed
- Total read_only time on a MySQL level
- More pre-flight checks such as
- is pt-heartbeat running on the current master?
- Make heartbeat migration more robust until it is migrated to a systemd service or moved remotely (so it is automatic and etcd-dependent)
- It alters or checks some master-related variables automatically (pt-config-diff h=localhost /etc/my.cnf ?):
- Alter expire_log_days variable
- Alter gtid mode automatically
- Alter semi-sync automatically
- Provide timestamps for each step executed
Improving testability and confidence on the automation to then implement switchover when the old master is unreachable (T196366) and later on implement emergency failover in T384810
Incremental implementation + test progress:
- functional test
- run against test-s4 section
- run against prod on secondary DC
- run against prod primary DC