Page MenuHomePhabricator

Switchover m5 master (db1132 -> db1107)
Closed, ResolvedPublic

Description

Databases on m5:

labsdbaccounts
mailman3
mailman3web
striker
test_labsdbaccounts
toolhub

When: 9th March - 14:00 UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1132

NEW MASTER: db1107

  • Check configuration differences between new and old master

$ pt-config-diff h=db1107.eqiad.wmnet,F=/root/.my.cnf h=db1132.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1107

db-switchover --timeout=15 --only-slave-move db1132.eqiad.wmnet db1107.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m5 from db1132 to db1107 - T302190
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --read-only-master --skip-slave-move db1132 db1107

  • Reload haproxies (dbproxy1021 is the active one)
dbproxy1017:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1021:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1132)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1107 and db1132 puppet agent --enable && puppet agent -tv
  • Check affected services
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171970595
  • Close this ticket and create a ticket to move db1132 somewhere else: db1132 will be moved to s1: T301879#7759181

Event Timeline

Marostegui triaged this task as Medium priority.Feb 21 2022, 6:54 AM
Marostegui moved this task from Triage to Ready on the DBA board.

@Legoktm @bd808 @Andrew @Ladsgroup - ok to schedule this for 9th March at 08:00 AM UTC?
I am adding you all here as service owners for toolhub, striker/labsdbaccounts and mailman

ok to schedule this for 9th March at 08:00 AM UTC?

08:00 UTC is 01:00 in my local timezone. I don't think I can commit to being awake and able to help debug anything at that time. 14:00 UTC is probably as early as I can comfortably commit to being around for (2 hours before my normal start time). That being said, I don't think there is really much I'm good for in these switchovers other than trying to use the apps to see if they throw database errors.

I am sorry @bd808 I had the "default" time for other switchovers and didn't think about toolshub. I can definitely do 14:00 UTC (cc @Legoktm, @Andrew, @Ladsgroup)
Thanks!

I can live with 14:00 UTC, thanks for adjusting

Change 768954 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1107 to m5 master

https://gerrit.wikimedia.org/r/768954

Change 768955 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1107: Enable notifications

https://gerrit.wikimedia.org/r/768955

Change 768955 merged by Marostegui:

[operations/puppet@production] db1107: Enable notifications

https://gerrit.wikimedia.org/r/768955

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Change 768954 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1107 to m5 master

https://gerrit.wikimedia.org/r/768954

Mentioned in SAL (#wikimedia-operations) [2022-03-09T14:01:33Z] <marostegui> Failover m5 from db1132 to db1107 - T302190

Marostegui updated the task description. (Show Details)

This was done, thanks @bd808 @Andrew and @Legoktm for being around to check and test the affected services.