Page MenuHomePhabricator

Switchover m2 master db1119 -> db1195
Closed, ResolvedPublic

Description

db1119 needs to be upgraded to Bookworm

When: TBD
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1119

NEW MASTER: db1195

Check configuration differences between new and old master

  • $ pt-config-diff h=db1119.eqiad.wmnet,F=/root/.my.cnf h=db1195.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1195

db-switchover --timeout=15 --only-slave-move db1119.eqiad.wmnet db1195.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1119 to db1195 - T351863

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1119 db1195
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1119)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1119 and db1195

sudo cumin 'db1119* or db1195*' 'run-puppet-agent "primary switchover T351863"'

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • sudo db-mysql db1195 heartbeat -e "delete from heartbeat where file like 'db1119%';"
  • Update/resolve phabricator ticket about failover

Details

Related Changes in Gerrit:

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Icinga downtime and Alertmanager silence (ID=2e22f7dd-e11a-479b-9b69-723c75e5fba0) set by marostegui@cumin1001 for 2:00:00 on 6 host(s) and their services with reason: Switch

db[2133,2160].codfw.wmnet,db[1118-1119,1195,1217].eqiad.wmnet

Change 977319 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/977319

Change 977319 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1195 to m2 master

https://gerrit.wikimedia.org/r/977319

Mentioned in SAL (#wikimedia-operations) [2023-11-27T06:40:16Z] <marostegui> Failover m2 from db1119 to db1195 - T351863

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

This was done, RO time was around 7 seconds.