Page MenuHomePhabricator

Switchover m1 master (db1159 -> db1128)
Closed, ResolvedPublic

Description

db1159 needs to be reimaged to Bullseye.
Let's promote db1128 to master

When: Thursday 27th at 10AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpad
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1159

NEW MASTER: db1128

Check configuration differences between new and old master

  • $ pt-config-diff h=db1159.eqiad.wmnet,F=/root/.my.cnf h=db1128.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1128

db-switchover --timeout=1 --only-slave-move db1159.eqiad.wmnet db1128.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1159 to db1128 - T299624

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1159 db1128
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1159)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1128 and db1159

puppet agent --enable && run-puppet-agent

  • Check services affected (librenms, racktables, etherpad...)
  • Clean orchestrator heartbeat to remove the old masters' one.
  • Merge: https://gerrit.wikimedia.org/r/755960
  • Create floating ticket for db1159 to be moved to m2: T300243
  • Update/resolve phabricator ticket about failover

Event Timeline

Marostegui changed the task status from Open to Stalled.Jan 20 2022, 7:56 AM
Marostegui moved this task from Triage to Blocked on the DBA board.

Stalling this until db1128 is installed and populated with data.
Once this is done, I will add service owners to the task so we can arrange a date for everyone.

Marostegui triaged this task as Medium priority.Jan 20 2022, 7:56 AM

@jcrespo @MoritzMuehlenhoff @jbond @akosiaris @ayounsi I would like to do this master switchover on Thursday 27th at 10AM UTC. I expect just a few seconds of read-only time.
Would this date work for you all?

@jcrespo @MoritzMuehlenhoff @jbond @akosiaris @ayounsi I would like to do this master switchover on Thursday 27th at 10AM UTC. I expect just a few seconds of read-only time.
Would this date work for you all?

Ack, sounds good!

+1 as owner of database dbbackups and bacula9 and not sure if something else there.

With the backup hat, I believe I already have an answer, but just double confirming that the end state will be the same secondary database (no need to change anything regarding backup sources, right?).

Correct Jaime, nothing will change on the backups front!

As everyone can, I will do this on Thursday 27th at 10AM UTC

Marostegui changed the task status from Stalled to Open.Jan 21 2022, 11:28 AM
Marostegui updated the task description. (Show Details)

Change 755960 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Manually switchover primary stats db db1159 -> db1128

https://gerrit.wikimedia.org/r/755960

^@Marostegui I just remembered that dbbackups point to db1159, and not the proxy, due to the current TLS certificate limitation, and the worry about sensitive data being accessed cross-datacenter. It will have to be deployed after switchover. I can do it but involving you in case I don't happen to be around.

This is an exception we discussed in the past, and that in the future should be solved with a different TLS certificate workflow (there is nothing other than that preventing the usage of the proxy).

No problem, I can take care of that, I will add it to the list of steps

Change 757388 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/757388

Change 757388 merged by Marostegui:

[operations/puppet@production] db1128: Enable notifications

https://gerrit.wikimedia.org/r/757388

Change 757389 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1128 to m1 master

https://gerrit.wikimedia.org/r/757389

Mentioned in SAL (#wikimedia-operations) [2022-01-27T09:23:38Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624

Mentioned in SAL (#wikimedia-operations) [2022-01-27T09:23:43Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2078,2132].codfw.wmnet,db[1117,1128,1159].eqiad.wmnet with reason: Primary switchover m1 T299624

Change 757389 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1128 to m1 master

https://gerrit.wikimedia.org/r/757389

Mentioned in SAL (#wikimedia-operations) [2022-01-27T09:57:38Z] <jynus> Stopped Bacula Director Daemon service at backup1001 T299624

Mentioned in SAL (#wikimedia-operations) [2022-01-27T10:00:02Z] <marostegui> Failover m1 from db1159 to db1128 - T299624

Change 755960 merged by Marostegui:

[operations/puppet@production] dbbackups: Manually switchover primary stats db db1159 -> db1128

https://gerrit.wikimedia.org/r/755960

Change 757619 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1159: Disable notifications

https://gerrit.wikimedia.org/r/757619

etherpad needed to be restarted.

Change 757619 merged by Marostegui:

[operations/puppet@production] db1159: Disable notifications

https://gerrit.wikimedia.org/r/757619

Mentioned in SAL (#wikimedia-operations) [2022-01-27T10:17:48Z] <jynus> Started Bacula Director Daemon service at backup1001 T299624

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

All done