Page MenuHomePhabricator

Switchover m1 master (db1176 -> db1164)
Closed, ResolvedPublic

Description

db1176 is in row A and there will be a switch maintenance with hard downtime

When: Monday 13th Feb - 11AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpad
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1176

NEW MASTER: db1164

Check configuration differences between new and old master

  • $ pt-config-diff h=db1176.eqiad.wmnet,F=/root/.my.cnf h=db1164.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1164

db-switchover --timeout=1 --only-slave-move db1176.eqiad.wmnet db1164.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1176 to db1164 - T329259

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1176 db1164
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1176)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1164 and db1176

sudo cumin 'db1176* or db1164*' 'run-puppet-agent -e "primary switchover T329259"'

  • Check services affected (librenms, racktables, etherpad...)
  • Clean orchestrator heartbeat to remove the old masters' one: sudo db-mysql db1164 heartbeat -e "delete from heartbeat where file like 'db1176%';"
  • Merge backup ticket: https://gerrit.wikimedia.org/r/c/operations/puppet/+/887885/
  • Create floating ticket for db1176 to be moved to m5: T329478
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui triaged this task as Medium priority.Feb 9 2023, 8:30 AM
Marostegui added a project: DBA.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

Change 887885 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] monitoring.yaml: Replace m1 master

https://gerrit.wikimedia.org/r/887885

Change 888359 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1164 to m1 master

https://gerrit.wikimedia.org/r/888359

Change 888395 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Test db1164 in m1

https://gerrit.wikimedia.org/r/888395

Change 888395 merged by Marostegui:

[operations/puppet@production] mariadb: Test db1164 in m1

https://gerrit.wikimedia.org/r/888395

Mentioned in SAL (#wikimedia-operations) [2023-02-13T06:59:25Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259

Mentioned in SAL (#wikimedia-operations) [2023-02-13T06:59:41Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259

Change 888359 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1164 to m1 master

https://gerrit.wikimedia.org/r/888359

Mentioned in SAL (#wikimedia-operations) [2023-02-13T10:06:40Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259

Mentioned in SAL (#wikimedia-operations) [2023-02-13T10:06:56Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1176].eqiad.wmnet with reason: Primary switchover m1 T329259

Mentioned in SAL (#wikimedia-operations) [2023-02-13T10:16:28Z] <jynus> stopping bacula and disabling puppet at backup1001 for m1 switchover T329259

Mentioned in SAL (#wikimedia-operations) [2023-02-13T11:00:01Z] <marostegui> Failover m1 from db1176 to db1164 - T329259

Change 887885 merged by Marostegui:

[operations/puppet@production] dbbackups: Replace m1 master

https://gerrit.wikimedia.org/r/887885

Marostegui updated the task description. (Show Details)

This was done