Page MenuHomePhabricator

Switchover x1 master (db1103 -> db1120)
Closed, ResolvedPublic

Description

When: Thursday 30th June at 06:00 AM UTC

  • Team calendar invite

Affected wikis::
The x1 cluster is used by MediaWiki at WMF for databases that are "global" or "cross-wiki" in nature, and are typically associated with a MediaWiki extension

Affected features:

BounceHandler
Cognate
ContentTranslation
Echo
Flow
GrowthExperiments
ReadingLists
UrlShortener
WikimediaEditorTasks

Checklist:

NEW primary: db1120
OLD primary: db1103

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1103.eqiad.wmnet h=db1120.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover x1 T300472" 'A:db-section-x1'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1120 set-weight 0
sudo dbctl config commit -m "Set db1120 with weight 0 T300472"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1103 db1120
  • Disable puppet on both nodes
sudo cumin 'db1103* or db1120*' 'disable-puppet "primary switchover T300472"'

Failover:

  • Log the failover:
!log Starting x1 eqiad failover from db1103 to db1120 - T300472
  • Switch primaries:
db-mysql db1103 -e "set global read_only=1"
sudo db-switchover --read-only-master --skip-slave-move db1103 db1120
echo "===== db1103 (OLD)"; sudo db-mysql db1103 -e 'show slave status\G'
echo "===== db1120 (NEW)"; sudo db-mysql db1120 -e 'show slave status\G'
db-mysql db1120 -e "set global read_only=0"
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section x1 set-master db1120
sudo dbctl config commit -m "Promote db1120 to x1 primary and set section read-write T300472"
  • Restart puppet on both hosts:
sudo cumin 'db1103* or db1120*' 'run-puppet-agent -e "primary switchover T300472"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1120 heartbeat -e "delete from heartbeat where file like 'db1103%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1120
events_coredb_slave.sql on the new slave db1103
sudo dbctl instance db1103 set-candidate-master --section x1 true
sudo dbctl instance db1120 set-candidate-master --section x1 false
(dborch1001): sudo orchestrator-client -c untag -i db1120 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1103 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 'x1';"
  • (If needed): Depool db1103 for maintenance.
sudo dbctl instance db1103 depool
sudo dbctl config commit -m "Depool db1103 T300472"
  • Apply outstanding schema changes to db1103 (if any) (None pending)
  • Update/resolve this ticket.

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui added a project: DBA.
Marostegui moved this task from Triage to In progress on the DBA board.

Reminder that x1 cannot be set on RO from MW side yet so this needs to be done directly on mysql level: T298876

Marostegui renamed this task from Switchover x1 master (db1103 -> db1137) to Switchover x1 master (db1103 -> db1120).Jun 7 2022, 8:40 AM
Marostegui updated the task description. (Show Details)

db1120 needs to be rebooted for {T310485}

db1120 needs to be rebooted for {T310485}

Done

This will be done Thursday 30th at 06:00 AM UTC

This will be done Thursday 30th at 06:00 AM UTC

This month, I guess? :)

This will be done Thursday 30th at 06:00 AM UTC

This month, I guess? :)

Heh, yeah sorry

Change 809605 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/809605

Change 809607 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/809607

Mentioned in SAL (#wikimedia-operations) [2022-06-30T05:17:02Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472

Mentioned in SAL (#wikimedia-operations) [2022-06-30T05:17:21Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: Primary switchover x1 T300472

Mentioned in SAL (#wikimedia-operations) [2022-06-30T05:17:31Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db1120 with weight 0 T300472', diff saved to https://phabricator.wikimedia.org/P30632 and previous config saved to /var/cache/conftool/dbconfig/20220630-051730-root.json

Change 809607 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1120 to x1 master

https://gerrit.wikimedia.org/r/809607

Mentioned in SAL (#wikimedia-operations) [2022-06-30T06:03:24Z] <marostegui> Starting x1 eqiad failover from db1103 to db1120 - T300472

Mentioned in SAL (#wikimedia-operations) [2022-06-30T06:06:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db1120 to x1 primary and set section read-write T300472', diff saved to https://phabricator.wikimedia.org/P30633 and previous config saved to /var/cache/conftool/dbconfig/20220630-060601-root.json

Change 809605 merged by Marostegui:

[operations/dns@master] wmnet: Update x1-master CNAME

https://gerrit.wikimedia.org/r/809605

This was all done
RO start: 06:04:57
RO stop: 06:06:02

Total RO time: 83 seconds