Page MenuHomePhabricator

Switchover s3 master (db2209 -> db2205)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db2205
OLD primary: db2209

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2209.codfw.wmnet h=db2205.codfw.wmnet
  • Patch primary master with the new version

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T399930" 'A:db-section-s3'
  • Set NEW primary with weight 0
sudo dbctl instance db2205 set-weight 0
sudo dbctl config commit -m "Set db2205 with weight 0 T399930"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2205 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2205 from API/vslow/dump T399930"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2209 db2205
  • Disable puppet on both nodes
sudo cumin 'db2209* or db2205*' 'disable-puppet "primary switchover T399930"'

Failover:

  • Log the failover:
!log Starting s3 codfw failover from db2209 to db2205 - T399930
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2209 db2205
echo "===== db2209 (OLD)"; sudo db-mysql db2209 -e 'show slave status\G'
echo "===== db2205 (NEW)"; sudo db-mysql db2205 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s3 set-master db2205
sudo dbctl config commit -m "Promote db2205 to s3 primary T399930"
  • Clean up heartbeat table(s).
sudo db-mysql db2205 heartbeat -e "delete from heartbeat where file like 'db2209%';"
  • Restart puppet on both hosts:
sudo cumin 'db2209* or db2205*' 'run-puppet-agent -e "primary switchover T399930"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2205
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2209
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2209 set-candidate-master --section s3 true
sudo dbctl instance db2205 set-candidate-master --section s3 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2205 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2209 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db2209 for maintenance.
sudo dbctl instance db2209 depool
sudo dbctl config commit -m "Depool db2209 T399930"
  • Change db2209 weight to mimic the previous weight db2205 (main/api/vslow/dumps):
sudo dbctl instance db2209 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1170499 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2205 to s3 master

https://gerrit.wikimedia.org/r/1170499

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-07-21T08:05:29Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2205 with weight 0 T399930', diff saved to https://phabricator.wikimedia.org/P79476 and previous config saved to /var/cache/conftool/dbconfig/20250721-080528-root.json

Mentioned in SAL (#wikimedia-operations) [2025-07-21T08:05:39Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T399930

Patched the master and ready for the switch
Old weight 400

Change #1170499 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2205 to s3 master

https://gerrit.wikimedia.org/r/1170499

Mentioned in SAL (#wikimedia-operations) [2025-07-21T08:08:36Z] <marostegui> Starting s3 codfw failover from db2209 to db2205 - T399930

Mentioned in SAL (#wikimedia-operations) [2025-07-21T08:09:08Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2205 to s3 primary T399930', diff saved to https://phabricator.wikimedia.org/P79478 and previous config saved to /var/cache/conftool/dbconfig/20250721-080907-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-07-21T08:09:51Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2209 T399930', diff saved to https://phabricator.wikimedia.org/P79479 and previous config saved to /var/cache/conftool/dbconfig/20250721-080951-marostegui.json

Marostegui updated the task description. (Show Details)

Done