Page MenuHomePhabricator

Switchover s2 master (db2107 -> db2104)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s2.dblist

Checklist:

NEW primary: db2104
OLD primary: db2107

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2107.codfw.wmnet h=db2104.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s2 T327609" 'A:db-section-s2'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2104 set-weight 0
sudo dbctl config commit -m "Set db2104 with weight 0 T327609"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2107 db2104
  • Disable puppet on both nodes
sudo cumin 'db2107* or db2104*' 'disable-puppet "primary switchover T327609"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s2 codfw failover from db2107 to db2104 - T327609
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2107 db2104
echo "===== db2107 (OLD)"; sudo db-mysql db2107 -e 'show slave status\G'
echo "===== db2104 (NEW)"; sudo db-mysql db2104 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s2 set-master db2104
sudo dbctl config commit -m "Promote db2104 to s2 primary T327609"
  • Restart puppet on both hosts:
sudo cumin 'db2107* or db2104*' 'run-puppet-agent -e "primary switchover T327609"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2104 heartbeat -e "delete from heartbeat where file like 'db2107%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2104
events_coredb_slave.sql on the new slave db2107
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2107 set-candidate-master --section s2 true
sudo dbctl instance db2104 set-candidate-master --section s2 false
(dborch1001): sudo orchestrator-client -c untag -i db2104 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2107 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's2';"
  • (If needed): Depool db2107 for maintenance.
sudo dbctl instance db2107 depool
sudo dbctl config commit -m "Depool db2107 T327609"
  • Change db2107 weight to mimic the previous weight db2104:
sudo dbctl instance db2107 edit
  • Apply outstanding schema changes to db2107 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 882253 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/882253

Ladsgroup changed the task status from Open to In Progress.Jan 23 2023, 3:36 AM
Ladsgroup claimed this task.
Ladsgroup triaged this task as Medium priority.
Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".

Change 882253 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db2104 to s2 master

https://gerrit.wikimedia.org/r/882253

Mentioned in SAL (#wikimedia-operations) [2023-01-23T03:52:30Z] <Amir1> Starting s2 codfw failover from db2107 to db2104 - T327609

Mentioned in SAL (#wikimedia-operations) [2023-01-23T03:54:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db2107 T327609', diff saved to https://phabricator.wikimedia.org/P43207 and previous config saved to /var/cache/conftool/dbconfig/20230123-035458-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)