Page MenuHomePhabricator

Switchover s8 master (db2161 -> db2165)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s8.dblist

Checklist:

NEW primary: db2165
OLD primary: db2161

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2161.codfw.wmnet h=db2165.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s8 T365339" 'A:db-section-s8'
  • Set NEW primary with weight 0
sudo dbctl instance db2165 set-weight 0
sudo dbctl config commit -m "Set db2165 with weight 0 T365339"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db2165 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db2165 from API/vslow/dump T365339"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2161 db2165
  • Disable puppet on both nodes
sudo cumin 'db2161* or db2165*' 'disable-puppet "primary switchover T365339"'

Failover:

  • Log the failover:
!log Starting s8 codfw failover from db2161 to db2165 - T365339
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2161 db2165
echo "===== db2161 (OLD)"; sudo db-mysql db2161 -e 'show slave status\G'
echo "===== db2165 (NEW)"; sudo db-mysql db2165 -e 'show slave status\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s8 set-master db2165
sudo dbctl config commit -m "Promote db2165 to s8 primary T365339"
  • Restart puppet on both hosts:
sudo cumin 'db2161* or db2165*' 'run-puppet-agent -e "primary switchover T365339"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2165 heartbeat -e "delete from heartbeat where file like 'db2161%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2165
events_coredb_slave.sql on the new slave db2161
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2161 set-candidate-master --section s8 true
sudo dbctl instance db2165 set-candidate-master --section s8 false
(dborch1001): sudo orchestrator-client -c untag -i db2165 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2161 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's8';"
  • (If needed): Depool db2161 for maintenance.
sudo dbctl instance db2161 depool
sudo dbctl config commit -m "Depool db2161 T365339"
  • Change db2161 weight to mimic the previous weight db2165:
sudo dbctl instance db2161 edit
  • Apply outstanding schema changes to db2161 (if any)
  • Update/resolve this ticket.

Event Timeline

Change #1033389 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2165 to s8 master

https://gerrit.wikimedia.org/r/1033389

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:35:07Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s8 T365339

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:35:24Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db2165 with weight 0 T365339', diff saved to https://phabricator.wikimedia.org/P62670 and previous config saved to /var/cache/conftool/dbconfig/20240520-053523-root.json

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:35:36Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s8 T365339

Change #1033389 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2165 to s8 master

https://gerrit.wikimedia.org/r/1033389

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:57:48Z] <marostegui> Starting s8 codfw failover from db2161 to db2165 - T365339

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:58:14Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db2165 to s8 primary T365339', diff saved to https://phabricator.wikimedia.org/P62671 and previous config saved to /var/cache/conftool/dbconfig/20240520-055812-root.json

Mentioned in SAL (#wikimedia-operations) [2024-05-20T05:59:09Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db2161 T365339', diff saved to https://phabricator.wikimedia.org/P62672 and previous config saved to /var/cache/conftool/dbconfig/20240520-055908-root.json

The pending schema change on the old master will be tracked in this task T364299