Page MenuHomePhabricator

Switchover m1 master (db1164 -> db1101)
Closed, ResolvedPublic

Description

db1164 is in row B and there will be a switch maintenance with hard downtime

When: Monday 27st March at 08:00 AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpad
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1164

NEW MASTER: db1101

  • Check configuration differences between new and old master pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1101.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1101

db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1101.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1164 to db1101 - T331510

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1101
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1164)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1101 and db1164

sudo cumin 'db1164* or db1101*' 'run-puppet-agent -e "primary switchover T331510"'

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

When: Tuesday 14th March at 07:00 AM UTC

Given that the switch maintenance has been moved to 28th, I am going to move this switchover to the 21st, so @jcrespo is around to handle bacula database (which lives on m1)

Once this ticket is done, the following day we need to create a "revert" to go back to db1164 as db1101 would be affected by row C maintenance T331882 the following week,

When: Monday 27st March at 09:00 AM UTC

Change 902572 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1101 to m1 master

https://gerrit.wikimedia.org/r/902572

Change 902574 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1012,dbproxy1014: Test db1101 on proxies

https://gerrit.wikimedia.org/r/902574

Change 902574 merged by Marostegui:

[operations/puppet@production] dbproxy1012,dbproxy1014: Test db1101 on proxies

https://gerrit.wikimedia.org/r/902574

Mentioned in SAL (#wikimedia-operations) [2023-03-27T05:14:21Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510

Mentioned in SAL (#wikimedia-operations) [2023-03-27T05:14:37Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1101,1117,1164].eqiad.wmnet with reason: m1 master switch T331510

Marostegui updated the task description. (Show Details)

Topology changes made.

Change 903175 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] backups: Replace db1164 with db1101

https://gerrit.wikimedia.org/r/903175

Change 902572 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1101 to m1 master

https://gerrit.wikimedia.org/r/902572

Mentioned in SAL (#wikimedia-operations) [2023-03-27T07:39:50Z] <jynus> disabling puppet and shutding down bacula at backup1001 T331510

Mentioned in SAL (#wikimedia-operations) [2023-03-27T08:03:43Z] <marostegui> Failover m1 from db1164 to db1101 - T331510

Change 903175 merged by Marostegui:

[operations/puppet@production] backups: Replace db1164 with db1101

https://gerrit.wikimedia.org/r/903175

Change 903181 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1101: Make it master

https://gerrit.wikimedia.org/r/903181

Change 903181 merged by Marostegui:

[operations/puppet@production] db1101: Make it master

https://gerrit.wikimedia.org/r/903181

Mentioned in SAL (#wikimedia-operations) [2023-03-27T08:28:14Z] <jynus> restarting bacula at backup1001 T331510