Event Timeline
Comment Actions
Steps and checklist:
Preparation
NEW master: db1104
OLD master: db1109
- Check current topology: db-replication-tree db1109
- Check configuration differences between new and old master: pt-config-diff h=db1109.eqiad.wmnet,F=/root/.my.cnf h=db1104.eqiad.wmnet,F=/root/.my.cnf
- Silence alerts on all hosts: cookbook sre.hosts.downtime --hours 1 --reason "switchover to db1104 T239238" '(A:db-section-s8 and A:eqiad) or A:db-labsdb'
- Set NEW master with weight 0: dbctl instance db1104 set-weight 0 && dbctl config commit -m "Set db1104 with weight 0 T239238"
- Topology changes, connect everything to db1104: db-switchover --timeout=15 --only-slave-move db1109.eqiad.wmnet db1104.eqiad.wmnet
- Disable puppet @db1104 and @db1109: cumin 'db110[4,9].eqiad.wmnet' 'disable-puppet "switchover to db1104 T239238"'
- Merge gerrit puppet change to promote db1104: TBD
Failover:
- Start the failover: !log Starting s8 eqiad failover from db1109 to db1104 - T239238
- Topology changes, move old master beneath new master: db-switchover --replicating-master --read-only-master db1109 db1104
- Give weight to db1109 (old master): dbctl instance db1109 set-weight 300
- Promote db1104 as new master: dbctl --scope eqiad section s8 set-master db1104 && dbctl config commit -m "Promote db1104 on s8 eqiad master T239238"
- Restart puppet on old and new masters (for heartbeat): cumin 'db110[4,9].eqiad.wmnet' 'run-puppet-agent -e "switchover to db1104 T239238"'
Clean up tasks:
- change events for query killer:
events_coredb_master.sql on the new master db1104 events_coredb_slave.sql on the new slave db1109
- Update DNS: TBD
- Update/resolve phabricator ticket about failover TBD
- Update candidate master dbctl notes and pick new candidate master: db1109