Page MenuHomePhabricator

Switchover m5 master db1119 -> db1176
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1119

NEW MASTER: db1176

  • Check configuration differences between new and old master

$ pt-config-diff h=db1176.eqiad.wmnet,F=/root/.my.cnf h=db1119.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --minutes 60 -r "m5 master switch T352631" 'A:db-section-m5'
  • Topology changes: move everything under db1176

db-switchover --timeout=15 --only-slave-move db1119.eqiad.wmnet db1176.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1119 to db1176 - T352631
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1119 db1176

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1119)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1176 and db1119 sudo cumin 'db1176* or db1119*' 'run-puppet-agent -e "primary switchover T352631"'
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id= 171970573;

Event Timeline

Marostegui triaged this task as Medium priority.Dec 4 2023, 7:05 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

This will be done tomorrow early in the EU morning.

Mentioned in SAL (#wikimedia-operations) [2023-12-05T06:17:05Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631

Mentioned in SAL (#wikimedia-operations) [2023-12-05T06:17:22Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631

Mentioned in SAL (#wikimedia-operations) [2023-12-05T06:23:50Z] <marostegui> Failover m5 from db1119 to db1176 - T352631

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done, read only time was around 10 seconds