Page MenuHomePhabricator

Switchover m3 master (db1107 -> db1183)
Closed, ResolvedPublic

Description

Databases on m3: phabricator
When: Tuesday 15th at 08:00 AM UTC
Impact: Writes will be disabled for around 1 minute.

Failover process

OLD MASTER: db1107

NEW MASTER: db1183

  • Check configuration differences between new and old master

$ pt-config-diff h=db1107.eqiad.wmnet,F=/root/.my.cnf h=db1183.eqiad.wmnet,F=/root/.my.cnf

  • Silence alerts on all hosts
  • Topology changes: move everything under db1183

db-switchover --timeout=15 --only-slave-move db1107.eqiad.wmnet db1183.eqiad.wmnet

puppet agent -tv && cat /etc/haproxy/conf.d/db-master.cfg

  • Start the failover: !log Failover m3 from db1107 to db1183 - T301219
  • Set phabricator in RO:
ssh phab1001
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false
  • DB switchover

root@cumin1001:~/wmfmariadbpy/wmfmariadbpy# db-switchover --read-only-master --skip-slave-move db1107 db1183

  • Reload haproxies (dbproxy1020 is the active one)
dbproxy1016:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
dbproxy1020:   systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio
  • kill connections on the old master (db1107)

pt-kill --print --kill --victims all --match-all F=/dev/null,S=/run/mysqld/mysqld.sock

  • Restart puppet on old and new masters (for heartbeat):db1107 and db1183 puppet agent --enable && puppet agent -tv
  • Check services affected: phabricator
  • Clean orchestrator heartbeat to remove the old masters' one, otherwise Orchestrator will show lag: T301219#7710051
  • Close this ticket and create a ticket to update m5: T301654

Event Timeline

Marostegui triaged this task as Medium priority.Feb 8 2022, 7:58 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui added a project: User-notice.
Marostegui updated the task description. (Show Details)

Hallo! For Tech News entry context, how many/which wikis will this affect?
I need to know whether to use either: (A) a specific wiki-name, (B) a number (e.g. "11" with a link to this task for a listing), or (C) "all" in the message.
I.e. The 2 standard-messages we try to re-use are:

  • You will be able to read but not edit [name/number of wikis] for a few minutes on [$date]. This will happen around [$time]. This is for database maintenance.
  • All wikis will be read-only for a few minutes on [$date]. This is planned at [$time].

Hallo! For Tech News entry context, how many/which wikis will this affect?
I need to know whether to use either: (A) a specific wiki-name, (B) a number (e.g. "11" with a link to this task for a listing), or (C) "all" in the message.
I.e. The 2 standard-messages we try to re-use are:

  • You will be able to read but not edit [name/number of wikis] for a few minutes on [$date]. This will happen around [$time]. This is for database maintenance.
  • All wikis will be read-only for a few minutes on [$date]. This is planned at [$time].

This only affects phabricator, there will be no wikis affected

Thanks! I'll leave it out of Tech News then, as 1 minute of phab write-downtime won't affect most editors, and technical folks will presumably be notified by you/someone via wikitech-l@.

Yeah, that makes sense!
And indeed, I was planning to send an email to wikitech-l 24h before the maintenance window.

Thank you!

Change 762145 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1183: Enable notifications

https://gerrit.wikimedia.org/r/762145

Change 762145 merged by Marostegui:

[operations/puppet@production] db1183: Enable notifications

https://gerrit.wikimedia.org/r/762145

Change 762146 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1183 to m3 master

https://gerrit.wikimedia.org/r/762146

Marostegui updated the task description. (Show Details)

To clean up orchestrator:

delete from heartbeat where server_id=171966678

Change 762146 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1183 to m3 master

https://gerrit.wikimedia.org/r/762146

All pre-failover steps are done

Mentioned in SAL (#wikimedia-operations) [2022-02-15T08:00:03Z] <marostegui> Failover m3 from db1107 to db1183 - T301219

Marostegui raised the priority of this task from Medium to Needs Triage.Feb 15 2022, 8:01 AM

testing priority change

Marostegui triaged this task as Medium priority.Feb 15 2022, 8:01 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

All done - it took around 50 seconds read-only time.