db1115 needs to be decommissioned.
Affected wikis:: None
Affected services:: Orchestrator, Prometheus, zarcillo
Checklist:
NEW primary: db1215
OLD primary: db1115
- Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1115.eqiad.wmnet h=db1215.eqiad.wmnet
- Reboot db1215 so it picks the latest kernel
Failover prep:
- Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover db_inventory T335014" 'A:db-inventory'
- Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1115 db1215
- Disable puppet on both nodes
sudo cumin 'db1115* or db1215*' 'disable-puppet "primary switchover T335014"'
- Merge gerrit puppet change to promote NEW primary: https://gerrit.wikimedia.org/r/c/operations/puppet/+/917323/
Failover:
- Log the failover:
!log Starting db-inventory eqiad failover from db1115 to db1215 - T335014
- Switch primaries:
db-mysql db1115 -e "set global read_only=1;" db1115: systemctl stop pt-heartbeat-wikimedia.service db-mysql db1215 -e "show master status\G" # grab the positon db-mysql db1215 -e "stop slave; reset slave all;" db-mysql db1115 -e "change master to master_host='db1215.eqiad.wmnet', master_port=3306, master_ssl=1, master_log_file='XX', master_log_pos=XX, master_user='repl', master_password='XX'; start slave;" db-mysql db1115 -e "show slave status\G" db-mysql db1215 -e "set global read_only=0;" db-mysql db1115 -e "STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;"
- Restart puppet on both hosts:
sudo cumin 'db1115* or db1215*' 'run-puppet-agent -e "primary switchover T335014"'
Clean up tasks:
- Clean up heartbeat table(s).
sudo db-mysql db1215 heartbeat -e "delete from heartbeat where file like 'db1115%';"
- Check zarcillo was updated
- Needs to be done manually: https://phabricator.wikimedia.org/P13956
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'tendril';"
- Update all the services at: T334455
- Update/resolve this ticket.