Page MenuHomePhabricator

Switchover m5 master (db1107 -> db1183)
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub
cxserverdb

When: Thursday 1st Sept - 14:00 UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1107

NEW MASTER: db1183

  • Check configuration differences between new and old master

$ pt-config-diff h=db1107.eqiad.wmnet,F=/root/.my.cnf h=db1183.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1183

db-switchover --timeout=25 --only-slave-move db1107.eqiad.wmnet db1183.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1107 to db1183 - T316744
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --read-only-master --skip-slave-move db1107 db1183

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1107)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1107 and db1183 puppet agent --enable && puppet agent -tv
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966678;
  • Close this ticket and create a ticket to move db1107 to s1: T316870

Event Timeline

Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

@Andrew @bd808 if this hour (14:00 UTC) doesn't work for you, just let me know so I can adjust it!

I have a meeting conflict but at least I'll be awake, I can duck out of the meeting if something breaks.

Should work fine for me. I will set my alarm clock a bit earlier.

Change 828903 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1183 to m5 master

https://gerrit.wikimedia.org/r/828903

Mentioned in SAL (#wikimedia-operations) [2022-09-01T13:31:44Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1107,1117,1183].eqiad.wmnet with reason: switchover m5 T316744

Mentioned in SAL (#wikimedia-operations) [2022-09-01T13:32:00Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1107,1117,1183].eqiad.wmnet with reason: switchover m5 T316744

Change 828903 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1183 to m5 master

https://gerrit.wikimedia.org/r/828903

Mentioned in SAL (#wikimedia-operations) [2022-09-01T14:00:09Z] <marostegui> Failover m5 from db1107 to db1183 - T316744

Marostegui updated the task description. (Show Details)