Page MenuHomePhabricator

Switchover s6 master (db1173 -> db1231)
Closed, DeclinedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db1231
OLD primary: db1173

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1173.eqiad.wmnet h=db1231.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T364067" 'A:db-section-s6'
  • Set NEW primary with weight 0
sudo dbctl instance db1231 set-weight 0
sudo dbctl config commit -m "Set db1231 with weight 0 T364067"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1231 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1231 from API/vslow/dump T364067"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1173 db1231
  • Disable puppet on both nodes
sudo cumin 'db1173* or db1231*' 'disable-puppet "primary switchover T364067"'

Failover:

  • Log the failover:
!log Starting s6 eqiad failover from db1173 to db1231 - T364067
  • Set section read-only:
sudo dbctl --scope eqiad section s6 ro "Maintenance until 06:15 UTC - T364067"
sudo dbctl --scope codfw section s6 ro "Maintenance until 06:15 UTC - T364067"
sudo dbctl config commit -m "Set s6 eqiad as read-only for maintenance - T364067"
  • Check s6 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1173 db1231
echo "===== db1173 (OLD)"; sudo db-mysql db1173 -e 'show slave status\G'
echo "===== db1231 (NEW)"; sudo db-mysql db1231 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s6 set-master db1231
sudo dbctl --scope eqiad section s6 rw
sudo dbctl --scope codfw section s6 rw
sudo dbctl config commit -m "Promote db1231 to s6 primary and set section read-write T364067"
  • Restart puppet on both hosts:
sudo cumin 'db1173* or db1231*' 'run-puppet-agent -e "primary switchover T364067"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1231 heartbeat -e "delete from heartbeat where file like 'db1173%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1231
events_coredb_slave.sql on the new slave db1173
sudo dbctl instance db1173 set-candidate-master --section s6 true
sudo dbctl instance db1231 set-candidate-master --section s6 false
(dborch1001): sudo orchestrator-client -c untag -i db1231 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1173 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db1173 for maintenance.
sudo dbctl instance db1173 depool
sudo dbctl config commit -m "Depool db1173 T364067"
  • Change db1173 weight to mimic the previous weight db1231:
sudo dbctl instance db1173 edit
  • Apply outstanding schema changes to db1173 (if any)
  • Update/resolve this ticket.

Event Timeline

Change #1025916 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1231 to s6 master

https://gerrit.wikimedia.org/r/1025916

Change #1025917 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/1025917

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui added a parent task: Restricted Task.May 3 2024, 6:44 AM

Mentioned in SAL (#wikimedia-operations) [2024-05-09T04:51:39Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067

Mentioned in SAL (#wikimedia-operations) [2024-05-09T04:52:04Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364067

Mentioned in SAL (#wikimedia-operations) [2024-05-09T04:52:16Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1231 with weight 0 T364067', diff saved to https://phabricator.wikimedia.org/P62162 and previous config saved to /var/cache/conftool/dbconfig/20240509-045216-marostegui.json

Change #1025916 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1231 to s6 master

https://gerrit.wikimedia.org/r/1025916

I am going to revert all these steps, the new master stalled while slaves were moving and I don't feel confident about it. I am going to revert all the steps, reclone it, and schedule this at some other time.

Marostegui closed this task as Declined.EditedMay 9 2024, 5:30 AM

I have undone all the steps - left db1231 (new) depooled so it can be recloned. Closing this as invalid and will create a new task once ready

Change #1025917 abandoned by Ladsgroup:

[operations/dns@master] wmnet: Update s6-master alias

Reason:

https://gerrit.wikimedia.org/r/1025917