Page MenuHomePhabricator

Switchover m5 master (db1183 -> db1176)
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

When: Thursday 9th at 16:00 UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1183

NEW MASTER: db1176

  • Check configuration differences between new and old master

$ pt-config-diff h=db1176.eqiad.wmnet,F=/root/.my.cnf h=db1183.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1176

db-switchover --timeout=15 --only-slave-move db1183.eqiad.wmnet db1176.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1183 to db1176 - T330847
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1183 db1176

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1183)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1176 and db1183 sudo cumin 'db1176* or db1183*' 'run-puppet-agent -e "primary switchover T330847"'
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171970778
  • Close this ticket and create a ticket to move db1183 somewhere else: db1183 will be moved to m1: T330977

Event Timeline

Marostegui changed the task status from Open to Stalled.Mar 1 2023, 10:31 AM
Marostegui triaged this task as Medium priority.

This needs to happen after 7th March once T329073 is finished

Marostegui moved this task from Triage to Blocked on the DBA board.

@bd808 @Andrew @Legoktm would Thursday 9th at 16:00 UTC work for you all?

would Thursday 9th at 16:00 UTC work for you all?

That date and time work for me.

cc. @Raymond_Ndibe in case you want to try maintaindbusers at that time (uses labsdbaccounts)

I am going to schedule this on Thursday 9th at 16:00 UTC - if someone has objections, please let me know!

Marostegui updated the task description. (Show Details)
Marostegui changed the task status from Stalled to Open.Mar 8 2023, 8:43 AM
Marostegui moved this task from Ready to In progress on the DBA board.

Change 895908 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m5-proxies: Add db1176 for testing

https://gerrit.wikimedia.org/r/895908

Change 895908 merged by Marostegui:

[operations/puppet@production] m5-proxies: Add db1176 for testing

https://gerrit.wikimedia.org/r/895908

Checked that haproxy sees db1176 just fine

Change 895910 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1176 to m5 master

https://gerrit.wikimedia.org/r/895910

Mentioned in SAL (#wikimedia-operations) [2023-03-09T15:02:00Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847

Mentioned in SAL (#wikimedia-operations) [2023-03-09T15:02:16Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db[2135,2160].codfw.wmnet,db[1117,1176,1183].eqiad.wmnet with reason: m5 master switch T330847

Change 895910 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1176 to m5 master

https://gerrit.wikimedia.org/r/895910

All the pre-failover steps are done. Waiting for 16:00 UTC to perform the actual switch.

Mentioned in SAL (#wikimedia-operations) [2023-03-09T16:00:09Z] <marostegui> Failover m5 from db1183 to db1176 - T330847

This was done, the RO time was around 15 seconds.
Thanks @bd808 for the support!