Page MenuHomePhabricator

Switchover m2 master db1195 -> db1119
Closed, ResolvedPublic

Description

db1195 needs to be upgraded to Bookworm

When: TBD
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1195

NEW MASTER: db1119

Check configuration differences between new and old master

  • $ pt-config-diff h=db1195.eqiad.wmnet,F=/root/.my.cnf h=db1119.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1119

db-switchover --timeout=15 --only-slave-move db1195.eqiad.wmnet db1119.eqiad.wmnet

  • Disable puppet @db1195 and puppet @db1119 sudo cumin 'db1195* or db1119*' 'disable-puppet "primary switchover T351638"'
  • Merge gerrit: https://gerrit.wikimedia.org/r/976884
  • Run puppet on dbproxy1023 and dbproxy1025 and check the config

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1195 to db1119 - T351638

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1195 db1119
  • Reload haproxies
dbproxy1013:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1015:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1195)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1195 and db1119

sudo cumin 'db1195* or db1119*' 'run-puppet-agent "primary switchover T351638"'

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • sudo db-mysql db1119 heartbeat -e "delete from heartbeat where file like 'db1195%';"
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui updated the task description. (Show Details)

Icinga downtime and Alertmanager silence (ID=947e8f0b-b0a7-4a94-bbb7-08d742280fb0) set by marostegui@cumin1001 for 2:00:00 on 6 host(s) and their services with reason: Switch

db[2133,2160].codfw.wmnet,db[1118-1119,1195,1217].eqiad.wmnet

Change 976884 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1119 to m2 master

https://gerrit.wikimedia.org/r/976884

Change 976884 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1119 to m2 master

https://gerrit.wikimedia.org/r/976884

Mentioned in SAL (#wikimedia-operations) [2023-11-23T06:44:23Z] <marostegui> Failover m2 from db1195 to db1119 - T351638

Marostegui updated the task description. (Show Details)

This was done, RO was just 5 seconds.