Page MenuHomePhabricator

Switchover m2 master db1164 -> db1195
Closed, ResolvedPublic

Description

db1164 needs to be rebooted
Let's promote db1195 to master

When: TBD
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1164

NEW MASTER: db1195

Check configuration differences between new and old master

  • $ pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1195.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1195

db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1195.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1164 to db1195 - T328253

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1195
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1164)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1164 and db1195

puppet agent --enable && run-puppet-agent

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • server_id=171966512
  • Create floating ticket for db1164 to be moved to m3: T328402
  • Update/resolve phabricator ticket about failover

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui
ResolvedMarostegui

Event Timeline

Marostegui triaged this task as Medium priority.Jan 30 2023, 7:39 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Change 885264 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1195: Enable notifications

https://gerrit.wikimedia.org/r/885264

Change 885264 merged by Marostegui:

[operations/puppet@production] db1195: Enable notifications

https://gerrit.wikimedia.org/r/885264

Change 885265 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/885265

Mentioned in SAL (#wikimedia-operations) [2023-01-31T07:06:28Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2133,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Primary switchover m2 T328253

Mentioned in SAL (#wikimedia-operations) [2023-01-31T07:06:44Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2133,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Primary switchover m2 T328253

Change 885265 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/885265

Mentioned in SAL (#wikimedia-operations) [2023-01-31T07:10:23Z] <marostegui> Failover m2 from db1164 to db1195 - T328253