Page MenuHomePhabricator

Switchover m3 (phabricator) master db1213 -> db1250
Closed, ResolvedPublic

Description

db1213 needs to be rebooted and upgraded

Databases on m3: phabricator
When: TBD
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1213

NEW MASTER: db1250

  • Check configuration differences between new and old master

$ pt-config-diff h=db1213.eqiad.wmnet,F=/root/.my.cnf h=db1250.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts: sudo cookbook sre.hosts.downtime --hours 1 -r "m3 master switchover T398818" 'A:db-section-m3'
  • Topology changes: move everything under db1250

db-switchover --timeout=15 --only-slave-move db1213.eqiad.wmnet db1250.eqiad.wmnet

run-puppet-agent && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1213 to db1250 - T398818
  • Set phabricator in RO:
ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --skip-slave-move db1213 db1250

  • Reload haproxies
dbproxy1026:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1028:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1213)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat): db1250 and db1213 sudo cumin 'db1250* or db1213*' 'run-puppet-agent -e "primary switchover T398818"'
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat.heartbeat where server_id=171970643;

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

@Aklapper is there any time where we should avoid making this switchover?
This would more or less require around 1 minute of read only time.

Marostegui added a parent task: Restricted Task.Jul 7 2025, 12:17 PM

Not sure when we'll attempt T370266: Update to Phorge upstream 2024.35 release which will be 1h-2h of downtime; @brennen should know best (might either be in the SRE window on Tuesdays starting 1500 UTC, or some other random time).
Apart from that, no restrictions / conflicts on any timing I'm aware of.

Thanks @Aklapper - my switchover will probably take place early in the european morning (somewhere between 05-06 AM UTC). I will wait for @brennen so we avoid migrating and switching over in the same day, just in case something goes wrong we reduce the number of possible variables to follow up with.

I'd like to aim for 2025-07-14 for the migration.

Gah, sorry, correction: 2025-07-15. 14th is Monday, aiming for Tuesday the 15th.

Mentioned in SAL (#wikimedia-operations) [2025-07-09T06:14:33Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818

Change #1167329 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] m3 proxies: Add db1250

https://gerrit.wikimedia.org/r/1167329

Change #1167329 merged by Marostegui:

[operations/puppet@production] m3 proxies: Add db1250

https://gerrit.wikimedia.org/r/1167329

Mentioned in SAL (#wikimedia-operations) [2025-07-09T06:20:26Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2160,2234].codfw.wmnet,db[1213,1217,1250].eqiad.wmnet with reason: m3 master switchover T398818

Change #1167375 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1250 to m3 master

https://gerrit.wikimedia.org/r/1167375

Change #1167375 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1250 to m3 master

https://gerrit.wikimedia.org/r/1167375

Mentioned in SAL (#wikimedia-operations) [2025-07-09T06:29:19Z] <marostegui> Failover m3 from db1213 to db1250 - T398818

Marostegui raised the priority of this task from Medium to Needs Triage.Jul 9 2025, 6:32 AM

test

Marostegui triaged this task as Medium priority.Jul 9 2025, 6:32 AM
Marostegui moved this task from In progress to Done on the DBA board.

test

Marostegui updated the task description. (Show Details)

This was done and the RO was 30 seconds.