Page MenuHomePhabricator

Switchover m1 master (db1164 -> db1195)
Closed, ResolvedPublic

Description

db1164 needs to be rebooted.
Let's promote db1195 to master

When: Thursday 25th at 08:30AM UTC
Impact: Read only for a few seconds on the services below:

Services running on m1:

  • bacula
  • cas (and cas staging)
  • backups
  • etherpad
  • librenms
  • pki
  • rt

Switchover steps:

OLD MASTER: db1164

NEW MASTER: db1195

Check configuration differences between new and old master

  • $ pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1195.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts
  • Topology changes: move everything under db1195

db-switchover --timeout=1 --only-slave-move db1164.eqiad.wmnet db1195.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover

!log Failover m1 from db1164 to db1195 - T315864

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1164 db1195
  • Reload haproxies
dbproxy1012:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1014:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1164)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1164 and db1195

puppet agent --enable && run-puppet-agent

Event Timeline

Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui added subscribers: jbond, ayounsi, jcrespo and 2 others.

@akosiaris @jcrespo @ayounsi @MoritzMuehlenhoff @jbond I would like to switchover m1 master on Thursday at 08:30 AM UTC. Expected impact would be a few seconds of RO time. Reads would remain unaffected.
Would that dat/time work for you all?

@akosiaris @jcrespo @ayounsi @MoritzMuehlenhoff @jbond I would like to switchover m1 master on Thursday at 08:30 AM UTC. Expected impact would be a few seconds of RO time. Reads would remain unaffected.
Would that dat/time work for you all?

For the IDPs any time is fine, the database in question is currently unused (and will only start to get used in the next weeks).

I am guessing that is "dbbackups" instead of "backups"?

I will disable ES cross-dc long term backups to bacula now, which normally run on Thursdays from 02:05 until around 18h, will start once the maintenance is completed. This will allow me to shutdown bacula with the least impact and risk.

I am guessing that is "dbbackups" instead of "backups"?

Yes, the service backups that uses dbbackups.

Good for me (librenms) CCing @fgiunchedi and @andrea.denisse for visibility too

Mentioned in SAL (#wikimedia-operations) [2022-08-22T13:03:19Z] <jynus> disabled backup scheduling for backup1002, backup2002 T315864

Change 826220 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Add db1195 as sby host for m1

https://gerrit.wikimedia.org/r/826220

Change 826220 merged by Marostegui:

[operations/puppet@production] mariadb: Add db1195 as sby host for m1

https://gerrit.wikimedia.org/r/826220

Added db1195 as a standby host in haproxy, reloaded both of them and the host gets correctly marked as UP.

Change 826222 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1195 to m1 master

https://gerrit.wikimedia.org/r/826222

Change 826223 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] backups: Replace m1 master

https://gerrit.wikimedia.org/r/826223

Mentioned in SAL (#wikimedia-operations) [2022-08-25T07:51:19Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864

Mentioned in SAL (#wikimedia-operations) [2022-08-25T07:51:35Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864

Change 826222 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1195 to m1 master

https://gerrit.wikimedia.org/r/826222

Mentioned in SAL (#wikimedia-operations) [2022-08-25T08:09:58Z] <marostegui> Reboot db1195 for kernel upgrade T315864

Mentioned in SAL (#wikimedia-operations) [2022-08-25T08:13:01Z] <jynus> stopping bacula services on backup1001 T315864

Mentioned in SAL (#wikimedia-operations) [2022-08-25T08:30:01Z] <marostegui> Failover m1 from db1164 to db1195 - T315864

Change 826223 merged by Marostegui:

[operations/puppet@production] dbbackups: Replace m1 master

https://gerrit.wikimedia.org/r/826223

Marostegui updated the task description. (Show Details)

All done

dbbackups tested with x1 and s5 runs:

name 	status 	type 	dc 	section 	start date 	end date 	duration 	total size
snapshot.s5.2022-08-25--12-43-53 	finished 	snapshot 	codfw 	s5 	Aug. 25, 2022, 1:44 p.m. 	Aug. 25, 2022, 2:35 p.m. 	50m 57s 	636.3 GB
snapshot.x1.2022-08-25--09-37-54 	finished 	snapshot 	codfw 	x1 	Aug. 25, 2022, 10:14 a.m. 	Aug. 25, 2022, 10:54 a.m. 	40m 26s 	377.9 GB

bacula delayed jobs now running:

Running Jobs:
Console connected using TLS at 25-Aug-22 16:01
Console connected using TLS at 01-Jan-70 00:00
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
471408  Back Full          0         0  backup1002.eqiad.wmnet-Weekly-Thu-EsRwCodfw-mysql-srv-backups-dumps-latest is running
471409  Back Full          0         0  backup2002.codfw.wmnet-Weekly-Thu-EsRwEqiad-mysql-srv-backups-dumps-latest is running