Page MenuHomePhabricator

Switchover m3 master db1164 -> db1159
Closed, ResolvedPublic

Description

Databases on m3: phabricator
When: Thursday 9th Feb 07:00 AM UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1159

NEW MASTER: db1164

  • Check configuration differences between new and old master

$ pt-config-diff h=db1164.eqiad.wmnet,F=/root/.my.cnf h=db1159.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1159

db-switchover --timeout=15 --only-slave-move db1164.eqiad.wmnet db1159.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1164 to db1159 - T329141
  • Set phabricator in RO:
ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --read-only-master --skip-slave-move db1164 db1159

  • Reload haproxies
dbproxy1016:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1020:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1164)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1159 and db1164 sudo cumin 'db1159* or db1164*' 'run-puppet-agent -e "primary switchover T329141"'
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: delete from heartbeat where server_id=171970746
  • Close this ticket and create a ticket to move db1164 to m1: T329143

Event Timeline

Marostegui triaged this task as Medium priority.Feb 8 2023, 8:39 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

The reason for this: T329013#8596654

Change 887724 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1159 to m3 mater

https://gerrit.wikimedia.org/r/887724

Change 887725 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1159: Enable notifications

https://gerrit.wikimedia.org/r/887725

Change 887725 merged by Marostegui:

[operations/puppet@production] db1159: Enable notifications

https://gerrit.wikimedia.org/r/887725

Change 887724 abandoned by Marostegui:

[operations/puppet@production] mariadb: Promote db1159 to m3 mater

Reason:

I need to do more changes before

https://gerrit.wikimedia.org/r/887724

Change 887727 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1159 to m3 mater

https://gerrit.wikimedia.org/r/887727

Change 887727 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1159 to m3 mater

https://gerrit.wikimedia.org/r/887727

Mentioned in SAL (#wikimedia-operations) [2023-02-09T07:00:06Z] <marostegui> Failover m3 from db1164 to db1159 - T329141

Marostegui moved this task from In progress to Done on the DBA board.
Marostegui updated the task description. (Show Details)

This was done