Page MenuHomePhabricator

Switchover m2 master db1119 -> db1195
Closed, ResolvedPublic

Description

db1119 needs to be upgraded to Bookworm

When: TBD
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1119

NEW MASTER: db1195

Check configuration differences between new and old master

  • $ pt-config-diff h=db1119.eqiad.wmnet,F=/root/.my.cnf h=db1195.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1195

db-switchover --timeout=15 --only-slave-move db1119.eqiad.wmnet db1195.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1119 to db1195 - T351863

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1119 db1195
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1119)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1119 and db1195

sudo cumin 'db1119* or db1195*' 'run-puppet-agent "primary switchover T351863"'

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • sudo db-mysql db1195 heartbeat -e "delete from heartbeat where file like 'db1119%';"
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui triaged this task as Medium priority.Nov 23 2023, 6:54 AM
Marostegui created this task.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Icinga downtime and Alertmanager silence (ID=2e22f7dd-e11a-479b-9b69-723c75e5fba0) set by marostegui@cumin1001 for 2:00:00 on 6 host(s) and their services with reason: Switch

db[2133,2160].codfw.wmnet,db[1118-1119,1195,1217].eqiad.wmnet

Change 977319 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/977319

Change 977319 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/977319

Mentioned in SAL (#wikimedia-operations) [2023-11-27T06:40:16Z] <marostegui> Failover m2 from db1119 to db1195 - T351863

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done, RO time was around 7 seconds.