Page MenuHomePhabricator

Switchover s6 master (db2214 -> db2229)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db2229
OLD primary: db2214

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2214.codfw.wmnet h=db2229.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T399533" 'A:db-section-s6'
  • Set NEW primary with weight 0
sudo dbctl instance db2229 set-weight 0
sudo dbctl config commit -m "Set db2229 with weight 0 T399533"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2229 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2229 from API/vslow/dump T399533"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2214 db2229
  • Disable puppet on both nodes
sudo cumin 'db2214* or db2229*' 'disable-puppet "primary switchover T399533"'

Failover:

  • Log the failover:
!log Starting s6 codfw failover from db2214 to db2229 - T399533
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2214 db2229
echo "===== db2214 (OLD)"; sudo db-mysql db2214 -e 'show slave status\G'
echo "===== db2229 (NEW)"; sudo db-mysql db2229 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s6 set-master db2229
sudo dbctl config commit -m "Promote db2229 to s6 primary T399533"
  • Clean up heartbeat table(s).
sudo db-mysql db2229 heartbeat -e "delete from heartbeat where file like 'db2214%';"
  • Restart puppet on both hosts:
sudo cumin 'db2214* or db2229*' 'run-puppet-agent -e "primary switchover T399533"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2229
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2214
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2214 set-candidate-master --section s6 true
sudo dbctl instance db2229 set-candidate-master --section s6 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2229 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2214 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db2214 for maintenance.
sudo dbctl instance db2214 depool
sudo dbctl config commit -m "Depool db2214 T399533"
  • Change db2214 weight to mimic the previous weight db2229 (main/api/vslow/dumps):
sudo dbctl instance db2214 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1169306 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2229 to s6 master

https://gerrit.wikimedia.org/r/1169306

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-07-16T08:06:39Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2229 with weight 0 T399533', diff saved to https://phabricator.wikimedia.org/P79185 and previous config saved to /var/cache/conftool/dbconfig/20250716-080639-root.json

Mentioned in SAL (#wikimedia-operations) [2025-07-16T08:06:50Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s6 T399533

Change #1169306 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2229 to s6 master

https://gerrit.wikimedia.org/r/1169306

Mentioned in SAL (#wikimedia-operations) [2025-07-16T08:12:40Z] <marostegui> Starting s6 codfw failover from db2214 to db2229 - T399533

Mentioned in SAL (#wikimedia-operations) [2025-07-16T08:13:03Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2229 to s6 primary T399533', diff saved to https://phabricator.wikimedia.org/P79186 and previous config saved to /var/cache/conftool/dbconfig/20250716-081302-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-07-16T08:13:52Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2214 T399533', diff saved to https://phabricator.wikimedia.org/P79187 and previous config saved to /var/cache/conftool/dbconfig/20250716-081350-marostegui.json

Marostegui updated the task description. (Show Details)

Done