Page MenuHomePhabricator

Switchover s5 master (db1183 -> db1230)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

Checklist:

NEW primary: db1230
OLD primary: db1183

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1183.eqiad.wmnet h=db1230.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T385147" 'A:db-section-s5'
  • Set NEW primary with weight 0
sudo dbctl instance db1230 set-weight 0
sudo dbctl config commit -m "Set db1230 with weight 0 T385147"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1230 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1230 from API/vslow/dump T385147"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db1183 db1230
  • Disable puppet on both nodes
sudo cumin 'db1183* or db1230*' 'disable-puppet "primary switchover T385147"'

Failover:

  • Log the failover:
!log Starting s5 eqiad failover from db1183 to db1230 - T385147
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db1183 db1230
echo "===== db1183 (OLD)"; sudo db-mysql db1183 -e 'show slave status\G'
echo "===== db1230 (NEW)"; sudo db-mysql db1230 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope eqiad section s5 set-master db1230
sudo dbctl config commit -m "Promote db1230 to s5 primary T385147"
  • Clean up heartbeat table(s).
sudo db-mysql db1230 heartbeat -e "delete from heartbeat where file like 'db1183%';"
  • Restart puppet on both hosts:
sudo cumin 'db1183* or db1230*' 'run-puppet-agent -e "primary switchover T385147"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1230
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1183
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1183 set-candidate-master --section s5 true
sudo dbctl instance db1230 set-candidate-master --section s5 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1230 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1183 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db1183 for maintenance.
sudo dbctl instance db1183 depool
sudo dbctl config commit -m "Depool db1183 T385147"
  • Change db1183 weight to mimic the previous weight db1230:
sudo dbctl instance db1183 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1115325 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1230 to s5 master

https://gerrit.wikimedia.org/r/1115325

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-01-30T09:32:21Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1230 with weight 0 T385147', diff saved to https://phabricator.wikimedia.org/P72839 and previous config saved to /var/cache/conftool/dbconfig/20250130-093221-root.json

Mentioned in SAL (#wikimedia-operations) [2025-01-30T09:32:49Z] <marostegui@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T385147

Change #1115325 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1230 to s5 master

https://gerrit.wikimedia.org/r/1115325

Mentioned in SAL (#wikimedia-operations) [2025-01-30T09:38:06Z] <marostegui> Starting s5 eqiad failover from db1183 to db1230 - T385147

Mentioned in SAL (#wikimedia-operations) [2025-01-30T09:38:49Z] <marostegui@cumin2002> dbctl commit (dc=all): 'Promote db1230 to s5 primary T385147', diff saved to https://phabricator.wikimedia.org/P72840 and previous config saved to /var/cache/conftool/dbconfig/20250130-093845-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-01-30T09:39:27Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1183 T385147', diff saved to https://phabricator.wikimedia.org/P72841 and previous config saved to /var/cache/conftool/dbconfig/20250130-093927-marostegui.json