Change Details

Databases on m3: `phabricator` When: TBD **Impact: Writes will be disabled for around 1 minute.** Failover process OLD MASTER: db1101 NEW MASTER: db1159 [x] Check configuration differences between new and old master `$ pt-config-diff h=db1101.eqiad.wmnet,F=/root/.my.cnf h=db1159.eqiad.wmnet,F=/root/.my.cnf ` [x] Silence alerts on all hosts: `sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T331387" 'A:db-section-m3'` [] Topology changes: move everything under db1159 `db-switchover --timeout=15 --only-slave-move db1101.eqiad.wmnet db1159.eqiad.wmnet` [] Disable puppet db1159 and db1101 `sudo cumin 'db1101* or db1159*' 'disable-puppet "primary switchover T331387"'` [] Merge gerrit: TO-DO [] Run puppet on dbproxy1016 and dbproxy1020 and check the config `run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg` [] Start the failover: `!log Failover m3 from db1101 to db1159 - T331387` [] Set phabricator in RO: ``` ssh phab1004 sudo /srv/phab/phabricator/bin/config set cluster.read-only true # restart database server sudo /srv/phab/phabricator/bin/config set cluster.read-only false ``` [] DB switchover `root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1101 db1159 ` [] Reload haproxies ``` dbproxy1016: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1020: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio ``` [] kill connections on the old master (db1101) ` pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock` [] Restart puppet on old and new masters (for heartbeat):db1159 and db1101 `sudo cumin 'db1159* or db1101*' 'run-puppet-agent -e "primary switchover T331387"'` [] Check services affected: phabricator [] Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: `delete from heartbeat where server_id=171974854;` [] Decommission db1101: T329352

Databases on m3: `phabricator` When: TBD **Impact: Writes will be disabled for around 1 minute.** Failover process OLD MASTER: db1101 NEW MASTER: db1159 [x] Check configuration differences between new and old master `$ pt-config-diff h=db1101.eqiad.wmnet,F=/root/.my.cnf h=db1159.eqiad.wmnet,F=/root/.my.cnf ` [x] Silence alerts on all hosts: `sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T331387" 'A:db-section-m3'` [x] Topology changes: move everything under db1159 `db-switchover --timeout=15 --only-slave-move db1101.eqiad.wmnet db1159.eqiad.wmnet` [x] Disable puppet db1159 and db1101 `sudo cumin 'db1101* or db1159*' 'disable-puppet "primary switchover T331387"'` [x] Merge gerrit: https://gerrit.wikimedia.org/r/c/operations/puppet/+/895299 [x] Run puppet on dbproxy1016 and dbproxy1020 and check the config `run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg` [] Start the failover: `!log Failover m3 from db1101 to db1159 - T331387` [] Set phabricator in RO: ``` ssh phab1004 sudo /srv/phab/phabricator/bin/config set cluster.read-only true # restart database server sudo /srv/phab/phabricator/bin/config set cluster.read-only false ``` [] DB switchover `root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1101 db1159 ` [] Reload haproxies ``` dbproxy1016: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1020: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio ``` [] kill connections on the old master (db1101) ` pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock` [] Restart puppet on old and new masters (for heartbeat):db1159 and db1101 `sudo cumin 'db1159* or db1101*' 'run-puppet-agent -e "primary switchover T331387"'` [] Check services affected: phabricator [] Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: `delete from heartbeat where server_id=171974854;` [] Decommission db1101: T329352

Databases on m3: `phabricator` When: TBD **Impact: Writes will be disabled for around 1 minute.** Failover process OLD MASTER: db1101 NEW MASTER: db1159 [x] Check configuration differences between new and old master `$ pt-config-diff h=db1101.eqiad.wmnet,F=/root/.my.cnf h=db1159.eqiad.wmnet,F=/root/.my.cnf ` [x] Silence alerts on all hosts: `sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T331387" 'A:db-section-m3'` [x] Topology changes: move everything under db1159 `db-switchover --timeout=15 --only-slave-move db1101.eqiad.wmnet db1159.eqiad.wmnet` [x] Disable puppet db1159 and db1101 `sudo cumin 'db1101* or db1159*' 'disable-puppet "primary switchover T331387"'` [x] Merge gerrit: TO-DOhttps://gerrit.wikimedia.org/r/c/operations/puppet/+/895299 [x] Run puppet on dbproxy1016 and dbproxy1020 and check the config `run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg` [] Start the failover: `!log Failover m3 from db1101 to db1159 - T331387` [] Set phabricator in RO: ``` ssh phab1004 sudo /srv/phab/phabricator/bin/config set cluster.read-only true # restart database server sudo /srv/phab/phabricator/bin/config set cluster.read-only false ``` [] DB switchover `root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1101 db1159 ` [] Reload haproxies ``` dbproxy1016: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio dbproxy1020: systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio ``` [] kill connections on the old master (db1101) ` pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock` [] Restart puppet on old and new masters (for heartbeat):db1159 and db1101 `sudo cumin 'db1159* or db1101*' 'run-puppet-agent -e "primary switchover T331387"'` [] Check services affected: phabricator [] Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: `delete from heartbeat where server_id=171974854;` [] Decommission db1101: T329352