Page MenuHomePhabricator

Switchover x1 master (db1220 -> db1237)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: TO-DO

Checklist:

NEW primary: db1237
OLD primary: db1220

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1220.eqiad.wmnet h=db1237.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover x1 T387557" 'A:db-section-x1'
  • Set NEW primary with weight 0
sudo dbctl instance db1237 set-weight 0
sudo dbctl config commit -m "Set db1237 with weight 0 T387557"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1237 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1237 from API/vslow/dump T387557"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db1220 db1237
  • Disable puppet on both nodes
sudo cumin 'db1220* or db1237*' 'disable-puppet "primary switchover T387557"'

Failover:

  • Log the failover:
!log Starting x1 eqiad failover from db1220 to db1237 - T387557
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db1220 db1237
echo "===== db1220 (OLD)"; sudo db-mysql db1220 -e 'show slave status\G'
echo "===== db1237 (NEW)"; sudo db-mysql db1237 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope eqiad section x1 set-master db1237
sudo dbctl config commit -m "Promote db1237 to x1 primary T387557"
  • Clean up heartbeat table(s).
sudo db-mysql db1237 heartbeat -e "delete from heartbeat where file like 'db1220%';"
  • Restart puppet on both hosts:
sudo cumin 'db1220* or db1237*' 'run-puppet-agent -e "primary switchover T387557"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1237
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1220
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1220 set-candidate-master --section x1 true
sudo dbctl instance db1237 set-candidate-master --section x1 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1237 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1220 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'x1';"
  • (If needed): Depool db1220 for maintenance.
sudo dbctl instance db1220 depool
sudo dbctl config commit -m "Depool db1220 T387557"
  • Change db1220 weight to mimic the previous weight db1237:
sudo dbctl instance db1220 edit
  • Update/resolve this ticket.

Event Timeline

Change #1123615 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1237 to x1 master

https://gerrit.wikimedia.org/r/1123615

Marostegui triaged this task as Medium priority.
Marostegui added a parent task: Restricted Task.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2025-03-03T12:16:24Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1237 with weight 0 T387557', diff saved to https://phabricator.wikimedia.org/P73946 and previous config saved to /var/cache/conftool/dbconfig/20250303-121623-root.json

Mentioned in SAL (#wikimedia-operations) [2025-03-03T12:16:40Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 13 hosts with reason: Primary switchover x1 T387557

Change #1123615 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1237 to x1 master

https://gerrit.wikimedia.org/r/1123615

Mentioned in SAL (#wikimedia-operations) [2025-03-03T12:22:47Z] <marostegui> Starting x1 eqiad failover from db1220 to db1237 - T387557

Mentioned in SAL (#wikimedia-operations) [2025-03-03T12:23:05Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1237 to x1 primary T387557', diff saved to https://phabricator.wikimedia.org/P73947 and previous config saved to /var/cache/conftool/dbconfig/20250303-122304-root.json

Mentioned in SAL (#wikimedia-operations) [2025-03-03T12:24:38Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1220 T387557', diff saved to https://phabricator.wikimedia.org/P73948 and previous config saved to /var/cache/conftool/dbconfig/20250303-122437-marostegui.json

Marostegui updated the task description. (Show Details)

Done