db1164 is in row B and there will be a switch maintenance with hard downtime
When: Monday 27st March at 08:00 AM UTC
Impact: Read only for a few seconds on the services below:
Services running on m1:
- bacula
- cas (and cas staging)
- backups
- etherpad
- librenms
- pki
- rt
Switchover steps:
OLD MASTER: db1164
NEW MASTER: db1101
- Check configuration differences between new and old master pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1101.eqiad.wmnet,F=/root/.my.cnf
- Silence alerts on all hosts
- Topology changes: move everything under db1101
db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1101.eqiad.wmnet
- Disable puppet @db1101 and puppet @db1164 sudo cumin 'db1164* or db1101*' 'disable-puppet "primary switchover T331510"'
- Merge gerrit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/902572/
- Run puppet on dbproxy1012 and dbproxy1014 and check the config
run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg
- Start the failover
!log Failover m1 from db1164 to db1101 - T331510
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1101
- Reload haproxies
dbproxy1012: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1014: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
- kill connections on the old master (db1164)
pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock
- Restart puppet on old and new masters (for heartbeat):db1101 and db1164
sudo cumin 'db1164* or db1101*' 'run-puppet-agent -e "primary switchover T331510"'
- Check services affected (librenms, racktables, etherpad...)
- Clean orchestrator heartbeat to remove the old masters' one: sudo db-mysql db1101 heartbeat -e "delete from heartbeat where file like 'db1164%';"
- Merge backup ticket: https://gerrit.wikimedia.org/r/903175
- Create a ticket to fail this back to db1164 the day after once T330165: eqiad row B switches upgrade is done to allow T331882: eqiad row C switches upgrade this to happen as db1101 is on row C: T333123: Switchover m1 master (db1101 -> db1164)
- Update/resolve phabricator ticket about failover