Page MenuHomePhabricator

Switchover s3 master (db2105 -> db2127)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db2127
OLD primary: db2105

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2105.codfw.wmnet h=db2127.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T327999" 'A:db-section-s3'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2127 set-weight 0
sudo dbctl config commit -m "Set db2127 with weight 0 T327999"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2105 db2127
  • Disable puppet on both nodes
sudo cumin 'db2105* or db2127*' 'disable-puppet "primary switchover T327999"'

Failover:

  • Log the failover:
!log Starting s3 codfw failover from db2105 to db2127 - T327999
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2105 db2127
echo "===== db2105 (OLD)"; sudo db-mysql db2105 -e 'show slave status\\G'
echo "===== db2127 (NEW)"; sudo db-mysql db2127 -e 'show slave status\\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s3 set-master db2127
sudo dbctl config commit -m "Promote db2127 to s3 primary T327999"
  • Restart puppet on both hosts:
sudo cumin 'db2105* or db2127*' 'run-puppet-agent -e "primary switchover T327999"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2127 heartbeat -e "delete from heartbeat where file like 'db2105%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2127
events_coredb_slave.sql on the new slave db2105
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2105 set-candidate-master --section s3 true
sudo dbctl instance db2127 set-candidate-master --section s3 false
(dborch1001): sudo orchestrator-client -c untag -i db2127 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2105 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db2105 for maintenance.
sudo dbctl instance db2105 depool
sudo dbctl config commit -m "Depool db2105 T327999"
  • Change db2105 weight to mimic the previous weight db2127:
sudo dbctl instance db2105 edit
  • Update/resolve this ticket.

Event Timeline

Change 883515 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2127 to s3 master

https://gerrit.wikimedia.org/r/883515

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:24:25Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:24:32Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2127 with weight 0 T327999', diff saved to https://phabricator.wikimedia.org/P43370 and previous config saved to /var/cache/conftool/dbconfig/20230126-082432-root.json

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:24:52Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 23 hosts with reason: Primary switchover s3 T327999

Change 883515 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2127 to s3 master

https://gerrit.wikimedia.org/r/883515

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:34:36Z] <marostegui> Starting s3 codfw failover from db2105 to db2127 - T327999

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:35:00Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2127 to s3 primary T327999', diff saved to https://phabricator.wikimedia.org/P43372 and previous config saved to /var/cache/conftool/dbconfig/20230126-083459-root.json

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:35:44Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2105 T327999', diff saved to https://phabricator.wikimedia.org/P43373 and previous config saved to /var/cache/conftool/dbconfig/20230126-083543-root.json

Marostegui updated the task description. (Show Details)

This is done