Page MenuHomePhabricator

Switchover s2 master (db2207 -> db2204)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist

Checklist:

NEW primary: db2204
OLD primary: db2207

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2207.codfw.wmnet h=db2204.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s2 T369130" 'A:db-section-s2'
  • Set NEW primary with weight 0
sudo dbctl instance db2204 set-weight 0
sudo dbctl config commit -m "Set db2204 with weight 0 T369130"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2204 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2204 from API/vslow/dump T369130"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2207 db2204
  • Disable puppet on both nodes
sudo cumin 'db2207* or db2204*' 'disable-puppet "primary switchover T369130"'

Failover:

  • Log the failover:
!log Starting s2 codfw failover from db2207 to db2204 - T369130
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2207 db2204
echo "===== db2207 (OLD)"; sudo db-mysql db2207 -e 'show slave status\G'
echo "===== db2204 (NEW)"; sudo db-mysql db2204 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s2 set-master db2204
sudo dbctl config commit -m "Promote db2204 to s2 primary T369130"
  • Restart puppet on both hosts:
sudo cumin 'db2207* or db2204*' 'run-puppet-agent -e "primary switchover T369130"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2204 heartbeat -e "delete from heartbeat where file like 'db2207%';"
  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2204
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2207
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2207 set-candidate-master --section s2 true
sudo dbctl instance db2204 set-candidate-master --section s2 false
(dborch1001): sudo orchestrator-client -c untag -i db2204 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2207 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's2';"
  • (If needed): Depool db2207 for maintenance.
sudo dbctl instance db2207 depool
sudo dbctl config commit -m "Depool db2207 T369130"
  • Change db2207 weight to mimic the previous weight db2204:
sudo dbctl instance db2207 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1051502 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2204 to s2 master

https://gerrit.wikimedia.org/r/1051502

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:06:40Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369130

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:06:48Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2204 with weight 0 T369130', diff saved to https://phabricator.wikimedia.org/P65693 and previous config saved to /var/cache/conftool/dbconfig/20240703-050647-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:07:04Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s2 T369130

Change #1051502 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2204 to s2 master

https://gerrit.wikimedia.org/r/1051502

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:20:05Z] <marostegui> Starting s2 codfw failover from db2207 to db2204 - T369130

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:20:30Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2204 to s2 primary T369130', diff saved to https://phabricator.wikimedia.org/P65694 and previous config saved to /var/cache/conftool/dbconfig/20240703-052029-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-03T05:21:18Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2207 T369130', diff saved to https://phabricator.wikimedia.org/P65696 and previous config saved to /var/cache/conftool/dbconfig/20240703-052118-root.json

Marostegui updated the task description. (Show Details)

Done