Page MenuHomePhabricator

Switchover s7 master (db2218 -> db2220)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist

Checklist:

NEW primary: db2220
OLD primary: db2218

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2218.codfw.wmnet h=db2220.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s7 T371462" 'A:db-section-s7'
  • Set NEW primary with weight 0
sudo dbctl instance db2220 set-weight 0
sudo dbctl config commit -m "Set db2220 with weight 0 T371462"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2220 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2220 from API/vslow/dump T371462"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2218 db2220
  • Disable puppet on both nodes
sudo cumin 'db2218* or db2220*' 'disable-puppet "primary switchover T371462"'

Failover:

  • Log the failover:
!log Starting s7 codfw failover from db2218 to db2220 - T371462
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2218 db2220
echo "===== db2218 (OLD)"; sudo db-mysql db2218 -e 'show slave status\G'
echo "===== db2220 (NEW)"; sudo db-mysql db2220 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s7 set-master db2220
sudo dbctl config commit -m "Promote db2220 to s7 primary T371462"
  • Restart puppet on both hosts:
sudo cumin 'db2218* or db2220*' 'run-puppet-agent -e "primary switchover T371462"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2220 heartbeat -e "delete from heartbeat where file like 'db2218%';"
  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2220
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2218
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2218 set-candidate-master --section s7 true
sudo dbctl instance db2220 set-candidate-master --section s7 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2220 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2218 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's7';"
  • (If needed): Depool db2218 for maintenance.
sudo dbctl instance db2218 depool
sudo dbctl config commit -m "Depool db2218 T371462"
  • Change db2218 weight to mimic the previous weight db2220:
sudo dbctl instance db2218 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1058571 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2220 to s7 master

https://gerrit.wikimedia.org/r/1058571

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-07-31T09:55:53Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462

Mentioned in SAL (#wikimedia-operations) [2024-07-31T09:56:10Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2220 with weight 0 T371462', diff saved to https://phabricator.wikimedia.org/P67148 and previous config saved to /var/cache/conftool/dbconfig/20240731-095609-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-31T09:56:17Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462

Mentioned in SAL (#wikimedia-operations) [2024-07-31T09:56:41Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Remove db2220 from API/vslow/dump T371462', diff saved to https://phabricator.wikimedia.org/P67149 and previous config saved to /var/cache/conftool/dbconfig/20240731-095640-root.json

Change #1058571 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2220 to s7 master

https://gerrit.wikimedia.org/r/1058571

Mentioned in SAL (#wikimedia-operations) [2024-07-31T10:14:38Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462

Mentioned in SAL (#wikimedia-operations) [2024-07-31T10:15:01Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T371462

Mentioned in SAL (#wikimedia-operations) [2024-07-31T10:33:23Z] <marostegui> Starting s7 codfw failover from db2218 to db2220 - T371462

Mentioned in SAL (#wikimedia-operations) [2024-07-31T10:35:13Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2220 to s7 primary T371462', diff saved to https://phabricator.wikimedia.org/P67150 and previous config saved to /var/cache/conftool/dbconfig/20240731-103513-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-31T10:37:04Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2218 T371462', diff saved to https://phabricator.wikimedia.org/P67151 and previous config saved to /var/cache/conftool/dbconfig/20240731-103704-marostegui.json

Marostegui updated the task description. (Show Details)

Done