Page MenuHomePhabricator

Switchover m5 master (db1176 -> db1106)
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

When: Thursday 16th at 08:00 UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1176

NEW MASTER: db1106

  • Check configuration differences between new and old master

$ pt-config-diff h=db1106.eqiad.wmnet,F=/root/.my.cnf h=db1176.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --minutes 60 -r "m5 master switch T331877" 'A:db-section-m5'
  • Topology changes: move everything under db1106

db-switchover --timeout=15 --only-slave-move db1176.eqiad.wmnet db1106.eqiad.wmnet

  • Disable puppet db1106 and db1176 sudo cumin 'db1106* or db1176*' 'disable-puppet "primary switchover T331877"'
  • Merge gerrit: https://gerrit.wikimedia.org/r/899571
  • Run puppet on dbproxy1017 and dbproxy1021 and check the config

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1176 to db1106 - T331877
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1176 db1106

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1176)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1106 and db1176 sudo cumin 'db1106* or db1176*' 'run-puppet-agent -e "primary switchover T331877"'
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171966607
  • Upgrade db1176 to mariadb 10.6 and create another ticket for the switchover back: T322294

Event Timeline

Marostegui triaged this task as Medium priority.Mar 13 2023, 2:07 PM
Marostegui created this task.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a subscriber: Legoktm.

@bd808 given that all the previous switchovers went well, if you are ok with this, I will do this at 08:00 AM UTC so you don't have to be around.
If needed I can also restart mailman cc @Legoktm

Change 899402 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m5: Replace the standby host

https://gerrit.wikimedia.org/r/899402

Change 899402 merged by Marostegui:

[operations/puppet@production] m5: Replace the standby host

https://gerrit.wikimedia.org/r/899402

I have tested the future master (db1106) which runs 10.6 on the proxies. It gets detected just fine.

@bd808 I am going to do this now, so tomorrow morning I can revert it.

Mentioned in SAL (#wikimedia-operations) [2023-03-15T12:11:58Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: m5 master switch T331877

Mentioned in SAL (#wikimedia-operations) [2023-03-15T12:12:15Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: m5 master switch T331877

Change 899570 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1106: Enable notifications

https://gerrit.wikimedia.org/r/899570

Change 899570 merged by Marostegui:

[operations/puppet@production] db1106: Enable notifications

https://gerrit.wikimedia.org/r/899570

Change 899571 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1106 to m5 master

https://gerrit.wikimedia.org/r/899571

Change 899571 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1106 to m5 master

https://gerrit.wikimedia.org/r/899571

Mentioned in SAL (#wikimedia-operations) [2023-03-15T12:18:05Z] <marostegui> Failover m5 from db1176 to db1106 - T331877

Marostegui updated the task description. (Show Details)

This was done, RO time was around 15 seconds.