Page MenuHomePhabricator

Switchover s1 master (db1184 -> db1163)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist

Checklist:

NEW primary: db1163
OLD primary: db1184

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1184.eqiad.wmnet h=db1163.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s1 T416480" 'A:db-section-s1'
  • Set NEW primary with weight 0
sudo dbctl instance db1163 set-weight 0
sudo dbctl config commit -m "Set db1163 with weight 0 T416480"
  • Topology changes, move all replicas under NEW primary, open orchestartor to monitor the process and check it at the end.
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db1184 db1163
  • Disable puppet on both nodes
sudo cumin 'db1184* or db1163*' 'disable-puppet "primary switchover T416480"'

Failover:

  • Log the failover:
!log Starting s1 eqiad failover from db1184 to db1163 - T416480
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db1184 db1163
echo "===== db1184 (OLD)"; sudo db-mysql db1184 -e 'show slave status\G'
echo "===== db1163 (NEW)"; sudo db-mysql db1163 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope eqiad section s1 set-master db1163
sudo dbctl config commit -m "Promote db1163 to s1 primary T416480"
  • Clean up heartbeat table(s).
sudo db-mysql db1163 heartbeat -e "delete from heartbeat where file like 'db1184%';"
  • Restart puppet on both hosts:
sudo cumin 'db1184* or db1163*' 'run-puppet-agent -e "primary switchover T416480"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1163
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1184
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1184 set-candidate-master --section s1 true
sudo dbctl instance db1163 set-candidate-master --section s1 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1163 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1184 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's1';"
  • Depool db1184 for maintenance.
sudo dbctl instance db1184 depool
sudo dbctl config commit -m "Depool db1184 T416480"
  • Change db1184 weight to mimic the previous weight db1163:
sudo dbctl instance db1184 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1236753 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/1236753

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2026-02-05T09:01:45Z] <marostegui@cumin1003> dbctl commit (dc=all): 'Set db1163 with weight 0 T416480', diff saved to https://phabricator.wikimedia.org/P88701 and previous config saved to /var/cache/conftool/dbconfig/20260205-090145-marostegui.json

Change #1236753 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/1236753

Mentioned in SAL (#wikimedia-operations) [2026-02-05T09:02:09Z] <marostegui@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T416480

Mentioned in SAL (#wikimedia-operations) [2026-02-05T09:02:44Z] <marostegui> Starting s1 eqiad failover from db1184 to db1163 - T416480

Mentioned in SAL (#wikimedia-operations) [2026-02-05T09:06:24Z] <marostegui@cumin1003> dbctl commit (dc=all): 'Promote db1163 to s1 primary T416480', diff saved to https://phabricator.wikimedia.org/P88702 and previous config saved to /var/cache/conftool/dbconfig/20260205-090623-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2026-02-05T09:07:03Z] <marostegui@cumin1003> dbctl commit (dc=all): 'Depool db1184 T416480', diff saved to https://phabricator.wikimedia.org/P88703 and previous config saved to /var/cache/conftool/dbconfig/20260205-090702-marostegui.json

Marostegui updated the task description. (Show Details)

Leaving db1184 depooled for schema change.