Change Details

db1164 is in row B and there will be a switch maintenance with hard downtime When: Tuesday 21st March at 0709:00 AM UTC Impact: Read only for a few seconds on the services below: Services running on m1: * bacula * cas (and cas staging) * backups * etherpad * librenms * pki * rt Switchover steps: OLD MASTER: db1164 NEW MASTER: db1101 [x] Check configuration differences between new and old master `pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1101.eqiad.wmnet,F=/root/.my.cnf ` [] Silence alerts on all hosts [] Topology changes: move everything under db1101 `db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1101.eqiad.wmnet` [] Disable puppet @db1101 and puppet @db1164 `sudo cumin 'db1164* or db1101*' 'disable-puppet "primary switchover T331510"'` [] Merge gerrit: TBD [] Run puppet on dbproxy1012 and dbproxy1014 and check the config `run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg` [x] Start the failover `!log Failover m1 from db1164 to db1101 - T331510` ``` root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1101 ``` [] Reload haproxies ``` dbproxy1012: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1014: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio ``` [] kill connections on the old master (db1164) ` pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock` [] Restart puppet on old and new masters (for heartbeat):db1101 and db1164 ` sudo cumin 'db1164* or db1101*' 'run-puppet-agent -e "primary switchover T331510"'` [] Check services affected (librenms, racktables, etherpad...) [] Clean orchestrator heartbeat to remove the old masters' one: `sudo db-mysql db1101 heartbeat -e "delete from heartbeat where file like 'db1164%';"` [] Merge backup ticket: TBD [] Decommission db1101: T331381 [] Update/resolve phabricator ticket about failover