Page MenuHomePhabricator

Switchover m2 master db1195 -> db1228
Closed, ResolvedPublic

Description

db1195 needs a reboot

When: TBD
Impact: Read only for a few seconds on the services below:

Services running on m2:

  • otrs
  • debmonitor
  • xhgui
  • recommendationapi
  • iegreview
  • sockpuppet
  • mwaddlink

Switchover steps:

OLD MASTER: db1195

NEW MASTER: db1228

Check configuration differences between new and old master

  • $ pt-config-diff h=db1195.eqiad.wmnet,F=/root/.my.cnf h=db1228.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1228

db-switchover --timeout=15 --only-slave-move db1195.eqiad.wmnet db1228.eqiad.wmnet

  • Disable puppet @db1195 and puppet @db1228 sudo cumin 'db1195* or db1228*' 'disable-puppet "primary switchover T368494"'
  • Merge gerrit: https://gerrit.wikimedia.org/r/1050814
  • Run puppet on dbproxy1023 and dbproxy1025 and check the config

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m2 from db1195 to db1228 - T368494

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1195 db1228
  • Reload haproxies
dbproxy1023:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1025:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1195)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1195 and db1228

sudo cumin 'db1195* or db1228*' 'run-puppet-agent -e "primary switchover T368494"'

  • Check services affected (otrs, debmonitor etc)
  • Clean orchestrator heartbeat to remove the old masters' one:
    • sudo db-mysql db1228 heartbeat -e "delete from heartbeat where file like 'db1195%';"
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Change #1049664 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m2 proxies: Test db1228

https://gerrit.wikimedia.org/r/1049664

Change #1049664 merged by Marostegui:

[operations/puppet@production] m2 proxies: Test db1228

https://gerrit.wikimedia.org/r/1049664

Change #1049664 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m2 proxies: Test db1228

https://gerrit.wikimedia.org/r/1049664

I have tested this patch and everything is okay with db1228

Mentioned in SAL (#wikimedia-operations) [2024-07-01T04:51:53Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494

Mentioned in SAL (#wikimedia-operations) [2024-07-01T04:52:09Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2133,2160].codfw.wmnet,db[1195,1217,1228].eqiad.wmnet with reason: m2 switchover T368494

Change #1050814 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1228 to m2 master

https://gerrit.wikimedia.org/r/1050814

Change #1050814 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1228 to m2 master

https://gerrit.wikimedia.org/r/1050814

Mentioned in SAL (#wikimedia-operations) [2024-07-01T04:56:42Z] <marostegui> Failover m2 from db1195 to db1228 - T368494

Icinga downtime and Alertmanager silence (ID=08096cfd-66bb-4f20-bf65-7daae1100319) set by marostegui@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Reboot

db1195.eqiad.wmnet