Page MenuHomePhabricator

Switchover s3 master (db2205 -> db2209)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db2209
OLD primary: db2205

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2205.codfw.wmnet h=db2209.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T371455" 'A:db-section-s3'
  • Set NEW primary with weight 0
sudo dbctl instance db2209 set-weight 0
sudo dbctl config commit -m "Set db2209 with weight 0 T371455"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2209 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2209 from API/vslow/dump T371455"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2205 db2209
  • Disable puppet on both nodes
sudo cumin 'db2205* or db2209*' 'disable-puppet "primary switchover T371455"'

Failover:

  • Log the failover:
!log Starting s3 codfw failover from db2205 to db2209 - T371455
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2205 db2209
echo "===== db2205 (OLD)"; sudo db-mysql db2205 -e 'show slave status\G'
echo "===== db2209 (NEW)"; sudo db-mysql db2209 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s3 set-master db2209
sudo dbctl config commit -m "Promote db2209 to s3 primary T371455"
  • Restart puppet on both hosts:
sudo cumin 'db2205* or db2209*' 'run-puppet-agent -e "primary switchover T371455"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2209 heartbeat -e "delete from heartbeat where file like 'db2205%';"
  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql db2209
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql db2205
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2205 set-candidate-master --section s3 true
sudo dbctl instance db2209 set-candidate-master --section s3 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i db2209 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i db2205 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db2205 for maintenance.
sudo dbctl instance db2205 depool
sudo dbctl config commit -m "Depool db2205 T371455"
  • Change db2205 weight to mimic the previous weight db2209:
sudo dbctl instance db2205 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1058294 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2209 to s3 master

https://gerrit.wikimedia.org/r/1058294

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Let's wait until the backups are done so the backup source can catch up with the master.

Mentioned in SAL (#wikimedia-operations) [2024-07-31T07:16:31Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 24 hosts with reason: Primary switchover s3 T371455

Mentioned in SAL (#wikimedia-operations) [2024-07-31T07:16:45Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2209 with weight 0 T371455', diff saved to https://phabricator.wikimedia.org/P67125 and previous config saved to /var/cache/conftool/dbconfig/20240731-071645-root.json

Mentioned in SAL (#wikimedia-operations) [2024-07-31T07:16:50Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 24 hosts with reason: Primary switchover s3 T371455

Main 300, vslow moved to the other host.

Change #1058294 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2209 to s3 master

https://gerrit.wikimedia.org/r/1058294

Mentioned in SAL (#wikimedia-operations) [2024-07-31T08:16:33Z] <marostegui> Starting s3 codfw failover from db2205 to db2209 - T371455

Mentioned in SAL (#wikimedia-operations) [2024-07-31T08:18:02Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2205 T371455', diff saved to https://phabricator.wikimedia.org/P67132 and previous config saved to /var/cache/conftool/dbconfig/20240731-081801-root.json

Marostegui updated the task description. (Show Details)

This is done