Page MenuHomePhabricator

Switchover s4 master (db2140 -> db2179)
Closed, DeclinedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s4.dblist

Checklist:

NEW primary: db2179
OLD primary: db2140

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2140.codfw.wmnet h=db2179.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s4 T374088" 'A:db-section-s4'
  • Set NEW primary with weight 0
sudo dbctl instance db2179 set-weight 0
sudo dbctl config commit -m "Set db2179 with weight 0 T374088"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2179 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2179 from API/vslow/dump T374088"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2140 db2179
  • Disable puppet on both nodes
sudo cumin 'db2140* or db2179*' 'disable-puppet "primary switchover T374088"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s4 codfw failover from db2140 to db2179 - T374088
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2140 db2179
echo "===== db2140 (OLD)"; sudo db-mysql db2140 -e 'show slave status\G'
echo "===== db2179 (NEW)"; sudo db-mysql db2179 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s4 set-master db2179
sudo dbctl config commit -m "Promote db2179 to s4 primary T374088"
  • Restart puppet on both hosts:
sudo cumin 'db2140* or db2179*' 'run-puppet-agent -e "primary switchover T374088"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2179 heartbeat -e "delete from heartbeat where file like 'db2140%';"
  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2179
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2140
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2140 set-candidate-master --section s4 true
sudo dbctl instance db2179 set-candidate-master --section s4 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2179 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2140 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's4';"
  • (If needed): Depool db2140 for maintenance.
sudo dbctl instance db2140 depool
sudo dbctl config commit -m "Depool db2140 T374088"
  • Change db2140 weight to mimic the previous weight db2179:
sudo dbctl instance db2140 edit
  • Apply outstanding schema changes to db2140 (if any)
  • Update/resolve this ticket.

Details

Event Timeline

Change #1070873 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2179 to s4 master

https://gerrit.wikimedia.org/r/1070873

ABran-WMF changed the task status from Open to In Progress.Sep 6 2024, 5:38 AM
ABran-WMF claimed this task.
ABran-WMF triaged this task as Medium priority.
ABran-WMF moved this task from Triage to Ready on the DBA board.

When is the maintenance window? Can we do this and s5 on Monday? Fridays are scary for these kind of stuff

I was aiming on monday indeed, those shouldn't need to be handled in the maintenance window as they are on codfw, or am I mistaken?

Indeed the window is not required. Just doing deploys and "risky" changes in Fridays can lead to pain during the weekend.

ABran-WMF moved this task from Ready to Done on the DBA board.

Change #1070873 abandoned by Ladsgroup:

[operations/puppet@production] mariadb: Promote db2179 to s4 master

https://gerrit.wikimedia.org/r/1070873