Page MenuHomePhabricator

Switchover s5 master (db2213 -> db2192)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

Checklist:

NEW primary: db2192
OLD primary: db2213

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2213.codfw.wmnet h=db2192.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T385148" 'A:db-section-s5'
  • Set NEW primary with weight 0
sudo dbctl instance db2192 set-weight 0
sudo dbctl config commit -m "Set db2192 with weight 0 T385148"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2192 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2192 from API/vslow/dump T385148"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db2213 db2192
  • Disable puppet on both nodes
sudo cumin 'db2213* or db2192*' 'disable-puppet "primary switchover T385148"'

Failover:

  • Log the failover:
!log Starting s5 codfw failover from db2213 to db2192 - T385148
  • Set section read-only:
sudo dbctl --scope eqiad section s5 ro "Maintenance until 06:15 UTC - T385148"
sudo dbctl --scope codfw section s5 ro "Maintenance until 06:15 UTC - T385148"
sudo dbctl config commit -m "Set s5 codfw as read-only for maintenance - T385148"
  • Check s5 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db2213 db2192
echo "===== db2213 (OLD)"; sudo db-mysql db2213 -e 'show slave status\G'
echo "===== db2192 (NEW)"; sudo db-mysql db2192 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section s5 set-master db2192
sudo dbctl --scope eqiad section s5 rw
sudo dbctl --scope codfw section s5 rw
sudo dbctl config commit -m "Promote db2192 to s5 primary and set section read-write T385148"
  • Clean up heartbeat table(s).
sudo db-mysql db2192 heartbeat -e "delete from heartbeat where file like 'db2213%';"
  • Restart puppet on both hosts:
sudo cumin 'db2213* or db2192*' 'run-puppet-agent -e "primary switchover T385148"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2192
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2213
sudo dbctl instance db2213 set-candidate-master --section s5 true
sudo dbctl instance db2192 set-candidate-master --section s5 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2192 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2213 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db2213 for maintenance.
sudo dbctl instance db2213 depool
sudo dbctl config commit -m "Depool db2213 T385148"
  • Change db2213 weight to mimic the previous weight db2192:
sudo dbctl instance db2213 edit
  • Update/resolve this ticket.

Event Timeline

Change #1115328 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2192 to s5 master

https://gerrit.wikimedia.org/r/1115328

Change #1115329 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/1115329

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2025-02-06T06:59:26Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2192 with weight 0 T385148', diff saved to https://phabricator.wikimedia.org/P73276 and previous config saved to /var/cache/conftool/dbconfig/20250206-065925-root.json

Mentioned in SAL (#wikimedia-operations) [2025-02-06T06:59:39Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s5 T385148

Change #1115328 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2192 to s5 master

https://gerrit.wikimedia.org/r/1115328

Mentioned in SAL (#wikimedia-operations) [2025-02-06T07:18:24Z] <marostegui> Starting s5 codfw failover from db2213 to db2192 - T385148

Mentioned in SAL (#wikimedia-operations) [2025-02-06T07:18:37Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s5 codfw as read-only for maintenance - T385148', diff saved to https://phabricator.wikimedia.org/P73277 and previous config saved to /var/cache/conftool/dbconfig/20250206-071836-root.json

Mentioned in SAL (#wikimedia-operations) [2025-02-06T07:19:03Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2192 to s5 primary and set section read-write T385148', diff saved to https://phabricator.wikimedia.org/P73278 and previous config saved to /var/cache/conftool/dbconfig/20250206-071902-root.json

Change #1115329 merged by Marostegui:

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/1115329

Mentioned in SAL (#wikimedia-operations) [2025-02-06T07:20:21Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2213 T385148', diff saved to https://phabricator.wikimedia.org/P73279 and previous config saved to /var/cache/conftool/dbconfig/20250206-072020-marostegui.json

Marostegui updated the task description. (Show Details)