Page MenuHomePhabricator

Switchover x3 master (db1255 -> db1258)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: TO-DO

Checklist:

NEW primary: db1258
OLD primary: db1255

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1255.eqiad.wmnet h=db1258.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover x3 T399699" 'A:db-section-x3'
  • Set NEW primary with weight 0
sudo dbctl instance db1258 set-weight 0
sudo dbctl config commit -m "Set db1258 with weight 0 T399699"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1258 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1258 from API/vslow/dump T399699"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1255 db1258
  • Disable puppet on both nodes
sudo cumin 'db1255* or db1258*' 'disable-puppet "primary switchover T399699"'

Failover:

  • Log the failover:
!log Starting x3 eqiad failover from db1255 to db1258 - T399699
  • Set section read-only:
sudo dbctl --scope eqiad section x3 ro "Maintenance until 06:15 UTC - T399699"
sudo dbctl --scope codfw section x3 ro "Maintenance until 06:15 UTC - T399699"
sudo dbctl config commit -m "Set x3 eqiad as read-only for maintenance - T399699"
  • Check x3 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1255 db1258
echo "===== db1255 (OLD)"; sudo db-mysql db1255 -e 'show slave status\G'
echo "===== db1258 (NEW)"; sudo db-mysql db1258 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section x3 set-master db1258
sudo dbctl --scope eqiad section x3 rw
sudo dbctl --scope codfw section x3 rw
sudo dbctl config commit -m "Promote db1258 to x3 primary and set section read-write T399699"
  • Clean up heartbeat table(s).
sudo db-mysql db1258 heartbeat -e "delete from heartbeat where file like 'db1255%';"
  • Restart puppet on both hosts:
sudo cumin 'db1255* or db1258*' 'run-puppet-agent -e "primary switchover T399699"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1258
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1255
sudo authdns-update
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1255 set-candidate-master --section x3 true
sudo dbctl instance db1258 set-candidate-master --section x3 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1258 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1255 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'x3';"
  • (If needed): Depool db1255 for maintenance.
sudo dbctl instance db1255 depool
sudo dbctl config commit -m "Depool db1255 T399699"
  • Change db1255 weight to mimic the previous weight db1258 (main/api/vslow/dumps):
sudo dbctl instance db1255 edit
  • Update/resolve this ticket.

Event Timeline

Change #1170098 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1258 to x3 master

https://gerrit.wikimedia.org/r/1170098

Change #1170099 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update x3-master alias

https://gerrit.wikimedia.org/r/1170099

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.

Installed the patched version on the current master

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:06:30Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1258 with weight 0 T399699', diff saved to https://phabricator.wikimedia.org/P79286 and previous config saved to /var/cache/conftool/dbconfig/20250717-060629-root.json

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:06:33Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 15 hosts with reason: Primary switchover x3 T399699

Change #1170098 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1258 to x3 master

https://gerrit.wikimedia.org/r/1170098

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:09:29Z] <marostegui> Starting x3 eqiad failover from db1255 to db1258 - T399699

Change #1170225 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbconfig.schema: Add x3

https://gerrit.wikimedia.org/r/1170225

Change #1170225 merged by Marostegui:

[operations/puppet@production] dbconfig.schema: Add x3

https://gerrit.wikimedia.org/r/1170225

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:18:01Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set x3 eqiad as read-only for maintenance - T399699', diff saved to https://phabricator.wikimedia.org/P79287 and previous config saved to /var/cache/conftool/dbconfig/20250717-061800-root.json

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:18:33Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1258 to x3 primary and set section read-write T399699', diff saved to https://phabricator.wikimedia.org/P79288 and previous config saved to /var/cache/conftool/dbconfig/20250717-061832-marostegui.json

Change #1170099 merged by Marostegui:

[operations/dns@master] wmnet: Update x3-master alias

https://gerrit.wikimedia.org/r/1170099

Mentioned in SAL (#wikimedia-operations) [2025-07-17T06:19:43Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1255 T399699', diff saved to https://phabricator.wikimedia.org/P79289 and previous config saved to /var/cache/conftool/dbconfig/20250717-061943-marostegui.json