Page MenuHomePhabricator

Switchover m3 master db1159 -> db1101
Closed, ResolvedPublic

Description

Databases on m3: phabricator
When: Today ASAP
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1159

NEW MASTER: db1101

  • Check configuration differences between new and old master

$ pt-config-diff h=db1101.eqiad.wmnet,F=/root/.my.cnf h=db1159.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T331384" 'A:db-section-m3'
  • Topology changes: move everything under db1101

db-switchover --timeout=15 --only-slave-move db1159.eqiad.wmnet db1101.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1159 to db1101 - T331384
  • Set phabricator in RO:
ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1159 db1101

  • Reload haproxies
dbproxy1016:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1020:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1159)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1159 and db1101 sudo cumin 'db1159* or db1101*' 'run-puppet-agent -e "primary switchover T331384"'
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966512;
  • Close this ticket and create a ticket to promote db1159 back to m3 master once T329073 is finished: T331387

Event Timeline

Marostegui updated the task description. (Show Details)

Change 895066 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1101 to m3 master

https://gerrit.wikimedia.org/r/895066

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-03-07T08:10:42Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384

Mentioned in SAL (#wikimedia-operations) [2023-03-07T08:10:58Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384

Mentioned in SAL (#wikimedia-operations) [2023-03-07T08:16:31Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384

Mentioned in SAL (#wikimedia-operations) [2023-03-07T08:16:37Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384

Change 895066 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1101 to m3 master

https://gerrit.wikimedia.org/r/895066

Mentioned in SAL (#wikimedia-operations) [2023-03-07T08:20:18Z] <marostegui> Failover m3 from db1159 to db1101 - T331384

Marostegui claimed this task.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Done

Mentioned in SAL (#wikimedia-operations) [2023-03-08T06:47:53Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384

Mentioned in SAL (#wikimedia-operations) [2023-03-08T06:48:09Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2134,2160].codfw.wmnet,db[1101,1117,1159].eqiad.wmnet with reason: m3 master switchover T331384