Page MenuHomePhabricator

Switchover m1 master (db1195 -> db1176)
Closed, ResolvedPublic

Description

db1195 needs to be rebooted.
Let's promote db1176 to master

When: Thursday 26th Jan
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpad
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1195

NEW MASTER: db1176

Check configuration differences between new and old master

  • $ pt-config-diff h=db1195.eqiad.wmnet,F=/root/.my.cnf h=db1176.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1176

db-switchover --timeout=1 --only-slave-move db1195.eqiad.wmnet db1176.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1195 to db1176 - T327800

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1195 db1176
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1195)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1164 and db1195

sudo cumin 'db1195* or db1176*' 'run-puppet-agent -e "primary switchover T327800"'

  • Check services affected (librenms, racktables, etherpad...)
  • Clean orchestrator heartbeat to remove the old masters' one: sudo db-mysql db1176 heartbeat -e "delete from heartbeat where file like 'db1195%';"
  • Merge backup ticket: https://gerrit.wikimedia.org/r/c/operations/puppet/+/883705/
  • Create floating ticket for db1195 to be moved to m2: T327995
  • Update/resolve phabricator ticket about failover

Related Objects

StatusSubtypeAssignedTask
ResolvedMarostegui

Event Timeline

Marostegui triaged this task as Medium priority.Tue, Jan 24, 4:29 PM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 883499 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1176: Enable notifications

https://gerrit.wikimedia.org/r/883499

Change 883499 merged by Marostegui:

[operations/puppet@production] db1176: Enable notifications

https://gerrit.wikimedia.org/r/883499

Mentioned in SAL (#wikimedia-operations) [2023-01-26T07:12:49Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800

Mentioned in SAL (#wikimedia-operations) [2023-01-26T07:12:54Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1176,1195].eqiad.wmnet with reason: Primary switchover m1 T327800

Change 883703 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1176 to m1 master

https://gerrit.wikimedia.org/r/883703

Change 883703 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1176 to m1 master

https://gerrit.wikimedia.org/r/883703

Mentioned in SAL (#wikimedia-operations) [2023-01-26T07:23:04Z] <marostegui> Failover m1 from db1195 to db1176 - T327800

Change 883705 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] monitoring.yaml: Change master for m1

https://gerrit.wikimedia.org/r/883705

Change 883705 merged by Jcrespo:

[operations/puppet@production] monitoring.yaml: Change master for m1

https://gerrit.wikimedia.org/r/883705

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

All done

Change 883727 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switchover m1 primary at which stats are pointing

https://gerrit.wikimedia.org/r/883727

Change 883727 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switchover m1 primary at which stats are pointing

https://gerrit.wikimedia.org/r/883727