Implement automation to depool/repool replicas as needed, manage role changes between master / candidate-master / replicas, and ultimately automate master failover.
Requirements:
- Track where/when ground truth is coming from
- Hard/soft distributed locks on section / dc to prevent clashing failover / maintenance activities
- Support multiple candidates and select based on priority
- Support topology-aware host selection (rack, TOR switch etc)
- Automated failover/depooling with velocity checks
- Critical path cannot depend on lower tier services
Initial design doc: https://docs.google.com/document/d/17ApZIOSyGP2kmMbhOgXT-9--_OrRaNCuSpThFPGW6eo/edit?tab=t.0
Roadmap:
- Create initial source of truth (see https://docs.google.com/document/d/1bwS0JMZ2gi6bNTbzt92WU0mxzIaR0ASJ_MVvD49VzaY/edit?tab=t.0 )
- Initial dashboard T384212
- Replica depool/repool running on testbed
- Dry-run depool/repool on prod
- Operational status management (provision, decommission, upgrade...) on testbed
- Replica depool/repool running on prod
- Host upgrades (implement T239814 ) on testbed
- Master failover on testbed
- Operational status management on prod
- Host upgrades on prod
- Master failover on prod
