Page MenuHomePhabricator

Switchover m2 master db1159 -> db1164
Closed, ResolvedPublic

Description

db1159 needs to be rebooted
Let's promote db1164 to master

When: Monday 29th Aug at 8:30AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m2:

Switchover steps:

OLD MASTER: db1159

NEW MASTER: db1164

Check configuration differences between new and old master

  • $ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1164.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1164

db-switchover --timeout=1 --only-slave-move db1159.eqiad.wmnet db1164.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1159 to db1164 - T316202

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1159 db1164
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1159)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1159 and db1164

puppet agent --enable && run-puppet-agent

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • server_id=171966512
  • Create floating ticket for db1159 to be moved to m3: T316500
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui created this task.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added subscribers: Tgr, Dzahn, Arnoldokoth and 7 others.

I am planning to do this switchover on Monday 29th at 08:30 AM UTC. The expected impact would be around 15-30 seconds of RO time. Reads won't be affected.

Change 827177 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1013,dbproxy1015: Add db1164 as standby

https://gerrit.wikimedia.org/r/827177

Change 827177 merged by Marostegui:

[operations/puppet@production] dbproxy1013,dbproxy1015: Add db1164 as standby

https://gerrit.wikimedia.org/r/827177

Change 827398 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1164 to m2 master

https://gerrit.wikimedia.org/r/827398

Mentioned in SAL (#wikimedia-operations) [2022-08-29T07:17:37Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2133,2160].codfw.wmnet,db[1117,1159,1164].eqiad.wmnet with reason: Switchover m2 T316202

Mentioned in SAL (#wikimedia-operations) [2022-08-29T07:17:54Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2133,2160].codfw.wmnet,db[1117,1159,1164].eqiad.wmnet with reason: Switchover m2 T316202

Change 827398 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1164 to m2 master

https://gerrit.wikimedia.org/r/827398

Mentioned in SAL (#wikimedia-operations) [2022-08-29T08:31:32Z] <marostegui> Failover m2 from db1159 to db1164 - T316202

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

All done