Page MenuHomePhabricator

Switchover s7 master (db2121 -> db2118)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist

Checklist:

NEW primary: db2118
OLD primary: db2121

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2121.codfw.wmnet h=db2118.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s7 T328000" 'A:db-section-s7'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2118 set-weight 0
sudo dbctl config commit -m "Set db2118 with weight 0 T328000"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --replicating-master --read-only-master --only-slave-move db2121 db2118
  • Disable puppet on both nodes
sudo cumin 'db2121* or db2118*' 'disable-puppet "primary switchover T328000"'

Failover:

  • Log the failover:
!log Starting s7 codfw failover from db2121 to db2118 - T328000
  • Switch primaries:
sudo db-switchover --replicating-master --read-only-master --skip-slave-move db2121 db2118
echo "===== db2121 (OLD)"; sudo db-mysql db2121 -e 'show slave status\\G'
echo "===== db2118 (NEW)"; sudo db-mysql db2118 -e 'show slave status\\G'
  • Promote NEW primary in dbctl
sudo dbctl --scope codfw section s7 set-master db2118
sudo dbctl config commit -m "Promote db2118 to s7 primary T328000"
  • Restart puppet on both hosts:
sudo cumin 'db2121* or db2118*' 'run-puppet-agent -e "primary switchover T328000"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2118 heartbeat -e "delete from heartbeat where file like 'db2121%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2118
events_coredb_slave.sql on the new slave db2121
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2121 set-candidate-master --section s7 true
sudo dbctl instance db2118 set-candidate-master --section s7 false
(dborch1001): sudo orchestrator-client -c untag -i db2118 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2121 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's7';"
  • (If needed): Depool db2121 for maintenance.
sudo dbctl instance db2121 depool
sudo dbctl config commit -m "Depool db2121 T328000"
  • Change db2121 weight to mimic the previous weight db2118:
sudo dbctl instance db2121 edit
  • Update/resolve this ticket.

Event Timeline

Change 883516 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2118 to s7 master

https://gerrit.wikimedia.org/r/883516

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:41:01Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:41:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set db2118 with weight 0 T328000', diff saved to https://phabricator.wikimedia.org/P43376 and previous config saved to /var/cache/conftool/dbconfig/20230126-084112-root.json

Mentioned in SAL (#wikimedia-operations) [2023-01-26T08:41:22Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 30 hosts with reason: Primary switchover s7 T328000

Change 883516 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2118 to s7 master

https://gerrit.wikimedia.org/r/883516

Mentioned in SAL (#wikimedia-operations) [2023-01-26T09:02:00Z] <marostegui> Starting s7 codfw failover from db2121 to db2118 - T328000

Mentioned in SAL (#wikimedia-operations) [2023-01-26T09:02:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote db2118 to s7 primary T328000', diff saved to https://phabricator.wikimedia.org/P43380 and previous config saved to /var/cache/conftool/dbconfig/20230126-090212-root.json

Mentioned in SAL (#wikimedia-operations) [2023-01-26T09:03:03Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2121 T328000', diff saved to https://phabricator.wikimedia.org/P43382 and previous config saved to /var/cache/conftool/dbconfig/20230126-090302-root.json

Marostegui updated the task description. (Show Details)

Done