Page MenuHomePhabricator

Switchover s1 master (db2103 -> db2112)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist

Checklist:

NEW primary: db2112
OLD primary: db2103

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2103.codfw.wmnet h=db2112.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s1 T344666" 'A:db-section-s1'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2112 set-weight 0
sudo dbctl config commit -m "Set db2112 with weight 0 T344666"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2103 db2112
  • Disable puppet on both nodes
sudo cumin 'db2103* or db2112*' 'disable-puppet "primary switchover T344666"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s1 codfw failover from db2103 to db2112 - T344666
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2103 db2112
echo "===== db2103 (OLD)"; sudo db-mysql db2103 -e 'show slave status\G'
echo "===== db2112 (NEW)"; sudo db-mysql db2112 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s1 set-master db2112
sudo dbctl config commit -m "Promote db2112 to s1 primary T344666"
  • Restart puppet on both hosts:
sudo cumin 'db2103* or db2112*' 'run-puppet-agent -e "primary switchover T344666"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2112 heartbeat -e "delete from heartbeat where file like 'db2103%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2112
events_coredb_slave.sql on the new slave db2103
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2103 set-candidate-master --section s1 true
sudo dbctl instance db2112 set-candidate-master --section s1 false
(dborch1001): sudo orchestrator-client -c untag -i db2112 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2103 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's1';"
  • (If needed): Depool db2103 for maintenance.
sudo dbctl instance db2103 depool
sudo dbctl config commit -m "Depool db2103 T344666"
  • Change db2103 weight to mimic the previous weight db2112:
sudo dbctl instance db2103 edit
  • Apply outstanding schema changes to db2103 (if any)
  • Update/resolve this ticket.

Details

Event Timeline

Change 951093 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2112 to s1 master

https://gerrit.wikimedia.org/r/951093

Ladsgroup triaged this task as Medium priority.
Ladsgroup moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:26:55Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344666

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:27:20Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344666

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:28:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db2112 with weight 0 T344666', diff saved to https://phabricator.wikimedia.org/P50790 and previous config saved to /var/cache/conftool/dbconfig/20230822-062854-ladsgroup.json

Change 951093 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db2112 to s1 master

https://gerrit.wikimedia.org/r/951093

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:52:46Z] <Amir1> Starting s1 codfw failover from db2103 to db2112 - T344666

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:53:16Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db2112 to s1 primary T344666', diff saved to https://phabricator.wikimedia.org/P50799 and previous config saved to /var/cache/conftool/dbconfig/20230822-065316-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:55:18Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db2103 T344666', diff saved to https://phabricator.wikimedia.org/P50802 and previous config saved to /var/cache/conftool/dbconfig/20230822-065518-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from In progress to Done on the DBA board.