Page MenuHomePhabricator

Switchover m5 master (db1106 -> db1176)
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

When: Thursday 16th at 08:00 UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1106

NEW MASTER: db1176

  • Check configuration differences between new and old master

$ pt-config-diff h=db1176.eqiad.wmnet,F=/root/.my.cnf h=db1106.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --minutes 60 -r "m5 master switch T332155" 'A:db-section-m5'
  • Topology changes: move everything under db1176

db-switchover --timeout=15 --only-slave-move db1106.eqiad.wmnet db1176.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1106 to db1176 - T332155
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1106 db1176

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1106)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1176 and db1106 sudo cumin 'db1176* or db1106*' 'run-puppet-agent -e "primary switchover T332155"'
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171978765;
  • Create a ticket to move db1106 back to s1: T332270

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-03-16T05:59:28Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: m5 master switch T332155

Mentioned in SAL (#wikimedia-operations) [2023-03-16T05:59:55Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: m5 master switch T332155

Mentioned in SAL (#wikimedia-operations) [2023-03-16T06:03:50Z] <marostegui> Failover m5 from db1106 to db1176 - T332155

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done and RO time was around 12 seconds.

Mentioned in SAL (#wikimedia-operations) [2023-12-04T06:57:20Z] <marostegui> Failover m5 from db1176 to db1119 - T332155