Page MenuHomePhabricator

Switchover m5 master db1176 -> db1119
Closed, ResolvedPublic

Description

db1176 needs to be reimaged

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1176

NEW MASTER: db1119

  • Check configuration differences between new and old master

$ pt-config-diff h=db1119.eqiad.wmnet,F=/root/.my.cnf h=db1176.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --minutes 60 -r "m5 master switch T352505" 'A:db-section-m5'
  • Topology changes: move everything under db1119

db-switchover --timeout=15 --only-slave-move db1176.eqiad.wmnet db1119.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1176 to db1119 - T332155
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1176 db1119

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1176)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1119 and db1176 sudo cumin 'db1119* or db1176*' 'run-puppet-agent -e "primary switchover T352505"'
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id= 171966607;

Event Timeline

Marostegui triaged this task as Medium priority.Dec 1 2023, 6:47 AM
Marostegui created this task.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

@taavi given that you are in EU timezone, I am planning to do this on Monday early morning. During the past m5 switchovers nothing was needed from WMCS in terms of services restarts/reloads, so this is just a heads up

Marostegui updated the task description. (Show Details)

Change 979198 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1021,27: Test db1119

https://gerrit.wikimedia.org/r/979198

Change 979198 merged by Marostegui:

[operations/puppet@production] dbproxy1021,27: Test db1119

https://gerrit.wikimedia.org/r/979198

Mentioned in SAL (#wikimedia-operations) [2023-12-04T06:46:57Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505

Mentioned in SAL (#wikimedia-operations) [2023-12-04T06:47:15Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505

Change 979489 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m5 master

https://gerrit.wikimedia.org/r/979489

Change 979489 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m5 master

https://gerrit.wikimedia.org/r/979489

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done, RO time was around 10 seconds

I will switch back tomorrow once db1176 has been reimaged