Databases on m5:
labsdbaccounts mailman3 mailman3web striker test_labsdbaccounts toolhub
When: Thursday 16th at 08:00 UTC
Impact: Writes will be disabled for around 1 minute.
Failover process
OLD MASTER: db1176
NEW MASTER: db1106
- Check configuration differences between new and old master
$ pt-config-diff h=db1106.eqiad.wmnet,F=/root/.my.cnf h=db1176.eqiad.wmnet,F=/root/.my.cnf
- Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --minutes 60 -r "m5 master switch T331877" 'A:db-section-m5'
- Topology changes: move everything under db1106
db-switchover --timeout=15 --only-slave-move db1176.eqiad.wmnet db1106.eqiad.wmnet
- Disable puppet db1106 and db1176 sudo cumin 'db1106* or db1176*' 'disable-puppet "primary switchover T331877"'
- Merge gerrit: https://gerrit.wikimedia.org/r/899571
- Run puppet on dbproxy1017 and dbproxy1021 and check the config
run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg
- Start the failover: !log Failover m5 from db1176 to db1106 - T331877
- DB switchover
root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1176 db1106
- Reload haproxies (dbproxy1021 is the active one)
dbproxy1017: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1021: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
- kill connections on the old master (db1176)
pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock
- Restart puppet on old and new masters (for heartbeat):db1106 and db1176 sudo cumin 'db1106* or db1176*' 'run-puppet-agent -e "primary switchover T331877"'
- Check affected services
- Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966607
- Upgrade db1176 to mariadb 10.6 and create another ticket for the switchover back: T322294