Page MenuHomePhabricator

Switchover s5 master (db1230 -> db1210)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

Checklist:

NEW primary: db1210
OLD primary: db1230

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1230.eqiad.wmnet h=db1210.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T399446" 'A:db-section-s5'
  • Set NEW primary with weight 0
sudo dbctl instance db1210 set-weight 0
sudo dbctl config commit -m "Set db1210 with weight 0 T399446"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1210 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1210 from API/vslow/dump T399446"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1230 db1210
  • Disable puppet on both nodes
sudo cumin 'db1230* or db1210*' 'disable-puppet "primary switchover T399446"'

Failover:

  • Log the failover:
!log Starting s5 eqiad failover from db1230 to db1210 - T399446
  • Set section read-only:
sudo dbctl --scope eqiad section s5 ro "Maintenance until 06:15 UTC - T399446"
sudo dbctl --scope codfw section s5 ro "Maintenance until 06:15 UTC - T399446"
sudo dbctl config commit -m "Set s5 eqiad as read-only for maintenance - T399446"
  • Check s5 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1230 db1210
echo "===== db1230 (OLD)"; sudo db-mysql db1230 -e 'show slave status\G'
echo "===== db1210 (NEW)"; sudo db-mysql db1210 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s5 set-master db1210
sudo dbctl --scope eqiad section s5 rw
sudo dbctl --scope codfw section s5 rw
sudo dbctl config commit -m "Promote db1210 to s5 primary and set section read-write T399446"
  • Clean up heartbeat table(s).
sudo db-mysql db1210 heartbeat -e "delete from heartbeat where file like 'db1230%';"
  • Restart puppet on both hosts:
sudo cumin 'db1230* or db1210*' 'run-puppet-agent -e "primary switchover T399446"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db1210
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db1230
sudo authdns-update
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1230 set-candidate-master --section s5 true
sudo dbctl instance db1210 set-candidate-master --section s5 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db1210 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db1230 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db1230 for maintenance.
sudo dbctl instance db1230 depool
sudo dbctl config commit -m "Depool db1230 T399446"
  • Change db1230 weight to mimic the previous weight db1210 (main/api/vslow/dumps):
sudo dbctl instance db1230 edit
  • Update/resolve this ticket.

Event Timeline

Change #1169049 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1210 to s5 master

https://gerrit.wikimedia.org/r/1169049

Change #1169050 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/1169050

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-07-15T05:49:44Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s5 T399446

Mentioned in SAL (#wikimedia-operations) [2025-07-15T05:50:12Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1210 with weight 0 T399446', diff saved to https://phabricator.wikimedia.org/P79039 and previous config saved to /var/cache/conftool/dbconfig/20250715-055011-root.json

Change #1169049 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1210 to s5 master

https://gerrit.wikimedia.org/r/1169049

Mentioned in SAL (#wikimedia-operations) [2025-07-15T05:54:10Z] <marostegui> Starting s5 eqiad failover from db1230 to db1210 - T399446

Mentioned in SAL (#wikimedia-operations) [2025-07-15T06:01:15Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T399446', diff saved to https://phabricator.wikimedia.org/P79040 and previous config saved to /var/cache/conftool/dbconfig/20250715-060114-root.json

Mentioned in SAL (#wikimedia-operations) [2025-07-15T06:02:24Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1210 to s5 primary and set section read-write T399446', diff saved to https://phabricator.wikimedia.org/P79041 and previous config saved to /var/cache/conftool/dbconfig/20250715-060223-marostegui.json

Change #1169050 merged by Marostegui:

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/1169050

Mentioned in SAL (#wikimedia-operations) [2025-07-15T06:06:01Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1230 T399446', diff saved to https://phabricator.wikimedia.org/P79042 and previous config saved to /var/cache/conftool/dbconfig/20250715-060600-root.json

Marostegui updated the task description. (Show Details)

This was done but as the primary master wasn't patched for T397425: Build 10.6.22 and 10.11.13 with mdev36934 patch, I got bitten by that bug and had to finish the switch manually.

Change #1169302 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1230: Disable notifications

https://gerrit.wikimedia.org/r/1169302

Change #1169302 merged by Marostegui:

[operations/puppet@production] db1230: Disable notifications

https://gerrit.wikimedia.org/r/1169302

Started cloning db1185.eqiad.wmnet to db1230.eqiad.wmnet - marostegui@cumin1002

Completed depool of db1185 - Depool db1185.eqiad.wmnet to then clone it to db1230.eqiad.wmnet - marostegui@cumin1002 - marostegui@cumin1002

Start pool of db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning - marostegui@cumin1002

Marostegui closed this task as Resolved.EditedJul 15 2025, 9:07 AM

Host has been cloned, so I am following the rest of the steps at T398928

Completed pool of db1185 gradually with 4 steps - Pool db1185.eqiad.wmnet in after cloning - marostegui@cumin1002

Finished cloning db1185.eqiad.wmnet to db1230.eqiad.wmnet - marostegui@cumin1002