Page MenuHomePhabricator

Switchover s5 codfw master (db2123 -> db2113)
Closed, ResolvedPublic

Description

Checklist:

NEW primary: db2113
OLD primary: db2123

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2123.codfw.wmnet h=db2113.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T317735" 'A:db-section-s5'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2113 set-weight 0
sudo dbctl config commit -m "Set db2113 with weight 0 T317735"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --replicating-master --read-only-master --timeout=25 --only-slave-move db2123 db2113
  • Disable puppet on both nodes
sudo cumin 'db2123* or db2113*' 'disable-puppet "primary switchover T317735"'

Failover:

  • Log the failover:
!log Starting s5 codfw failover from db2123 to db2113 - T317735
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2123 db2113
echo "===== db2123 (OLD)"; sudo db-mysql db2123 -e 'show slave status\G'
echo "===== db2113 (NEW)"; sudo db-mysql db2113 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section s5 set-master db2113
sudo dbctl config commit -m "Promote db2113 to s5 codfw primary T317735"
  • Restart puppet on both hosts:
sudo cumin 'db2123* or db2113*' 'run-puppet-agent -e "primary switchover T317735"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2113 heartbeat -e "delete from heartbeat where file like 'db2123%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2113
events_coredb_slave.sql on the new slave db2123
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2123 set-candidate-master --section s5 true
sudo dbctl instance db2113 set-candidate-master --section s5 false
(dborch1001): sudo orchestrator-client -c untag -i db2113 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2123 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db2123 for maintenance.
sudo dbctl instance db2123 depool
sudo dbctl config commit -m "Depool db2123 T317735"
  • Change db2123 weight to mimic the previous weight db2113:
sudo dbctl instance db2123 edit
  • Update/resolve this ticket.

Event Timeline

Marostegui triaged this task as Medium priority.Sep 14 2022, 5:49 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-09-14T05:51:22Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T317735

Mentioned in SAL (#wikimedia-operations) [2022-09-14T05:51:39Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T317735

Mentioned in SAL (#wikimedia-operations) [2022-09-14T05:51:56Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2113 with weight 0 T317735', diff saved to https://phabricator.wikimedia.org/P34687 and previous config saved to /var/cache/conftool/dbconfig/20220914-055156-marostegui.json

Change 832143 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2113 to s5 codfw master

https://gerrit.wikimedia.org/r/832143

Change 832143 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2113 to s5 codfw master

https://gerrit.wikimedia.org/r/832143

Mentioned in SAL (#wikimedia-operations) [2022-09-14T06:08:07Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2113 to s5 codfw primary T317735', diff saved to https://phabricator.wikimedia.org/P34689 and previous config saved to /var/cache/conftool/dbconfig/20220914-060807-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-09-14T06:09:14Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2123 T317735', diff saved to https://phabricator.wikimedia.org/P34690 and previous config saved to /var/cache/conftool/dbconfig/20220914-060913-root.json

Marostegui updated the task description. (Show Details)

Switchover done, I am going to upgrade the old master