Page MenuHomePhabricator

Switchover x1 codfw master (db2096 -> db2115)
Closed, ResolvedPublic

Description

When: Anytime, DC switch only.

Checklist:

NEW primary: db2115
OLD primary: db2096

sudo pt-config-diff --defaults-file /root/.my.cnf h=db2096.codfw.wmnet h=db2115.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "DC switchover x1 T316522" 'A:db-section-x1'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2115 set-weight 0
sudo dbctl config commit -m "Set db2115 with weight 0 T316522"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --read-only-master --replicating-master --read-only-master --timeout=25 --only-slave-move db2096 db2115
  • Disable puppet on both nodes
sudo cumin 'db2096* or db2115*' 'disable-puppet "primary switchover T316522"'

Failover:

  • Log the failover:
!log Starting x1 codfw failover from db2096 to db2115 - T316522
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2096 db2115
echo "===== db2096 (OLD)"; sudo db-mysql db2096 -e 'show slave status\G'
echo "===== db2115 (NEW)"; sudo db-mysql db2115 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section x1 set-master db2115
sudo dbctl config commit -m "Promote db2115 to x1 codfw primary T316522"
  • Restart puppet on both hosts:
sudo cumin 'db2096* or db2115*' 'run-puppet-agent -e "primary switchover T316522"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2115 heartbeat -e "delete from heartbeat where file like 'db2096%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2115
events_coredb_slave.sql on the new slave db2096
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2096 set-candidate-master --section x1 true
sudo dbctl instance db2115 set-candidate-master --section x1 false
(dborch1001): sudo orchestrator-client -c untag -i db2115 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2096 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 'x1';"
  • (If needed): Depool db2096 for maintenance.
sudo dbctl instance db2096 depool
sudo dbctl config commit -m "Depool db2096 T316522"
  • Change db2096 weight to mimic the previous weight db2115:
sudo dbctl instance db2096 edit
  • Apply outstanding schema changes to db2096 (if any) -> To be tracked on their own tasks.
  • Update/resolve this ticket.

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui added a project: DBA.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui renamed this task from Switchover x1 codfw master to Switchover x1 codfw master (db2096 -> db2115).Aug 30 2022, 6:41 AM

Change 827864 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2115 to x1 codfw master

https://gerrit.wikimedia.org/r/827864

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:30:29Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 10 hosts with reason: DC switchover x1 T316522

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:30:48Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 10 hosts with reason: DC switchover x1 T316522

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:31:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2115 with weight 0 T316522', diff saved to https://phabricator.wikimedia.org/P33658 and previous config saved to /var/cache/conftool/dbconfig/20220830-083103-root.json

Change 827864 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2115 to x1 codfw master

https://gerrit.wikimedia.org/r/827864

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:36:07Z] <marostegui> Starting x1 codfw failover from db2096 to db2115 - T316522

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:36:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2115 to x1 codfw primary T316522', diff saved to https://phabricator.wikimedia.org/P33659 and previous config saved to /var/cache/conftool/dbconfig/20220830-083654-root.json

Mentioned in SAL (#wikimedia-operations) [2022-08-30T08:38:45Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2096 T316522', diff saved to https://phabricator.wikimedia.org/P33660 and previous config saved to /var/cache/conftool/dbconfig/20220830-083845-root.json

Marostegui updated the task description. (Show Details)

All done