Page MenuHomePhabricator

Switchover s1 master (db1184 -> db1163)
Closed, ResolvedPublic

Description

When: Emergency for T399540: Upgrade masters to 10.6.22 and 10.11.13 .2 update

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist

Checklist:

NEW primary: db1163
OLD primary: db1184

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1184.eqiad.wmnet h=db1163.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s1 T404326" 'A:db-section-s1'
  • Set NEW primary with weight 0
sudo dbctl instance db1163 set-weight 0
sudo dbctl config commit -m "Set db1163 with weight 0 T404326"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1163 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1163 from API/vslow/dump T404326"
  • Set section read-only:
sudo dbctl --scope eqiad section s1 ro "Emergency maintenance until 11:15 UTC - T404326"
sudo dbctl --scope codfw section s1 ro "Emergency maintenance until 11:15 UTC - T404326"
sudo dbctl config commit -m "Set s1 eqiad as read-only for maintenance - T404326"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1184 db1163
  • Disable puppet on both nodes
sudo cumin 'db1184* or db1163*' 'disable-puppet "primary switchover T404326"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s1 eqiad failover from db1184 to db1163 - T404326
  • Check s1 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1184 db1163
echo "===== db1184 (OLD)"; sudo db-mysql db1184 -e 'show slave status\G'
echo "===== db1163 (NEW)"; sudo db-mysql db1163 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s1 set-master db1163
sudo dbctl --scope eqiad section s1 rw
sudo dbctl --scope codfw section s1 rw
sudo dbctl config commit -m "Promote db1163 to s1 primary and set section read-write T404326"
  • Clean up heartbeat table(s).
sudo db-mysql db1163 heartbeat -e "delete from heartbeat where file like 'db1184%';"
  • Restart puppet on both hosts:
sudo cumin 'db1184* or db1163*' 'run-puppet-agent -e "primary switchover T404326"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1163
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1184
  • Merge DNS change: FIX ME
sudo authdns-update
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1184 set-candidate-master --section s1 true
sudo dbctl instance db1163 set-candidate-master --section s1 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1163 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1184 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's1';"
  • (If needed): Depool db1184 for maintenance.
sudo dbctl instance db1184 depool
sudo dbctl config commit -m "Depool db1184 T404326"
  • Change db1184 weight to mimic the previous weight db1163 (main/api/vslow/dumps):
sudo dbctl instance db1184 edit
  • Apply outstanding schema changes to db1184 (if any)
  • Update/resolve this ticket.

Event Timeline

Change #1187389 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/1187389

Change #1187390 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/1187390

Ladsgroup triaged this task as Unbreak Now! priority.
Ladsgroup updated the task description. (Show Details)
Ladsgroup moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2025-09-11T10:59:02Z] <ladsgroup@cumin1003> DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404326

Mentioned in SAL (#wikimedia-operations) [2025-09-11T10:59:42Z] <ladsgroup@cumin1003> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s1 T404326

Mentioned in SAL (#wikimedia-operations) [2025-09-11T11:00:00Z] <ladsgroup@cumin1003> dbctl commit (dc=all): 'Set db1163 with weight 0 T404326', diff saved to https://phabricator.wikimedia.org/P83236 and previous config saved to /var/cache/conftool/dbconfig/20250911-105959-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2025-09-11T11:00:37Z] <ladsgroup@cumin1003> dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T404326', diff saved to https://phabricator.wikimedia.org/P83237 and previous config saved to /var/cache/conftool/dbconfig/20250911-110036-ladsgroup.json

Change #1187389 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1163 to s1 master

https://gerrit.wikimedia.org/r/1187389

Mentioned in SAL (#wikimedia-operations) [2025-09-11T11:07:33Z] <Amir1> Starting s1 eqiad failover from db1184 to db1163 - T404326

Mentioned in SAL (#wikimedia-operations) [2025-09-11T11:08:22Z] <ladsgroup@cumin1003> dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T404326', diff saved to https://phabricator.wikimedia.org/P83238 and previous config saved to /var/cache/conftool/dbconfig/20250911-110821-ladsgroup.json

Change #1187390 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/1187390

Mentioned in SAL (#wikimedia-operations) [2025-09-11T11:15:46Z] <ladsgroup@cumin1003> dbctl commit (dc=all): 'Depool db1184 T404326', diff saved to https://phabricator.wikimedia.org/P83239 and previous config saved to /var/cache/conftool/dbconfig/20250911-111545-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)