Page MenuHomePhabricator

Switchover s5 master (db2113 -> db2123)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

Checklist:

NEW primary: db2123
OLD primary: db2113

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2113.codfw.wmnet h=db2123.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T327611" 'A:db-section-s5'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2123 set-weight 0
sudo dbctl config commit -m "Set db2123 with weight 0 T327611"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2113 db2123
  • Disable puppet on both nodes
sudo cumin 'db2113* or db2123*' 'disable-puppet "primary switchover T327611"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s5 codfw failover from db2113 to db2123 - T327611
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2113 db2123
echo "===== db2113 (OLD)"; sudo db-mysql db2113 -e 'show slave status\G'
echo "===== db2123 (NEW)"; sudo db-mysql db2123 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s5 set-master db2123
sudo dbctl config commit -m "Promote db2123 to s5 primary T327611"
  • Restart puppet on both hosts:
sudo cumin 'db2113* or db2123*' 'run-puppet-agent -e "primary switchover T327611"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2123 heartbeat -e "delete from heartbeat where file like 'db2113%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2123
events_coredb_slave.sql on the new slave db2113
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2113 set-candidate-master --section s5 true
sudo dbctl instance db2123 set-candidate-master --section s5 false
(dborch1001): sudo orchestrator-client -c untag -i db2123 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2113 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db2113 for maintenance.
sudo dbctl instance db2113 depool
sudo dbctl config commit -m "Depool db2113 T327611"
  • Change db2113 weight to mimic the previous weight db2123:
sudo dbctl instance db2113 edit
  • Apply outstanding schema changes to db2113 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 882254 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2123 to s5 master

https://gerrit.wikimedia.org/r/882254

Ladsgroup changed the task status from Open to In Progress.Jan 23 2023, 4:32 AM
Ladsgroup claimed this task.
Ladsgroup triaged this task as Medium priority.
Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:32:53Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T327611

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:33:10Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T327611

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:33:25Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db2123 with weight 0 T327611', diff saved to https://phabricator.wikimedia.org/P43208 and previous config saved to /var/cache/conftool/dbconfig/20230123-043324-ladsgroup.json

Change 882254 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db2123 to s5 master

https://gerrit.wikimedia.org/r/882254

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:57:08Z] <Amir1> Starting s5 codfw failover from db2113 to db2123 - T327611

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:57:41Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db2123 to s5 primary T327611', diff saved to https://phabricator.wikimedia.org/P43209 and previous config saved to /var/cache/conftool/dbconfig/20230123-045740-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-01-23T04:59:40Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db2113 T327611', diff saved to https://phabricator.wikimedia.org/P43210 and previous config saved to /var/cache/conftool/dbconfig/20230123-045939-ladsgroup.json

Ladsgroup updated the task description. (Show Details)