Page MenuHomePhabricator

Switchover s7 master (db1236 -> db1181)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s7.dblist

Checklist:

NEW primary: db1181
OLD primary: db1236

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1236.eqiad.wmnet h=db1181.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s7 T367857" 'A:db-section-s7'
  • Set NEW primary with weight 0
sudo dbctl instance db1181 set-weight 0
sudo dbctl config commit -m "Set db1181 with weight 0 T367857"
  • Depool NEW from any specific group (API, vslow, dump) if present.
sudo dbctl instance db1181 edit
# If some changes were made:
sudo dbctl config commit -m "Remove db1181 from API/vslow/dump T367857"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1236 db1181
  • Disable puppet on both nodes
sudo cumin 'db1236* or db1181*' 'disable-puppet "primary switchover T367857"'

Failover:

  • Log the failover:
!log Starting s7 eqiad failover from db1236 to db1181 - T367857
  • Set section read-only:
sudo dbctl --scope eqiad section s7 ro "Maintenance until 06:15 UTC - T367857"
sudo dbctl --scope codfw section s7 ro "Maintenance until 06:15 UTC - T367857"
sudo dbctl config commit -m "Set s7 eqiad as read-only for maintenance - T367857"
  • Check s7 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1236 db1181
echo "===== db1236 (OLD)"; sudo db-mysql db1236 -e 'show slave status\G'
echo "===== db1181 (NEW)"; sudo db-mysql db1181 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s7 set-master db1181
sudo dbctl --scope eqiad section s7 rw
sudo dbctl --scope codfw section s7 rw
sudo dbctl config commit -m "Promote db1181 to s7 primary and set section read-write T367857"
  • Restart puppet on both hosts:
sudo cumin 'db1236* or db1181*' 'run-puppet-agent -e "primary switchover T367857"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1181 heartbeat -e "delete from heartbeat where file like 'db1236%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1181
events_coredb_slave.sql on the new slave db1236
sudo dbctl instance db1236 set-candidate-master --section s7 true
sudo dbctl instance db1181 set-candidate-master --section s7 false
(dborch1001): sudo orchestrator-client -c untag -i db1181 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1236 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's7';"
  • (If needed): Depool db1236 for maintenance.
sudo dbctl instance db1236 depool
sudo dbctl config commit -m "Depool db1236 T367857"
  • Change db1236 weight to mimic the previous weight db1181:
sudo dbctl instance db1236 edit
  • Update/resolve this ticket.

Event Timeline

Change #1047022 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1181 to s7 master

https://gerrit.wikimedia.org/r/1047022

Change #1047023 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s7-master alias

https://gerrit.wikimedia.org/r/1047023

Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:04:19Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Primary switchover s7 T367857

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:04:28Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1181 with weight 0 T367857', diff saved to https://phabricator.wikimedia.org/P65217 and previous config saved to /var/cache/conftool/dbconfig/20240620-050428-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:04:43Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Primary switchover s7 T367857

Change #1047022 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db1181 to s7 master

https://gerrit.wikimedia.org/r/1047022

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:22:17Z] <marostegui> Starting s7 eqiad failover from db1236 to db1181 - T367857

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:22:30Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T367857', diff saved to https://phabricator.wikimedia.org/P65218 and previous config saved to /var/cache/conftool/dbconfig/20240620-052230-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:22:54Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1181 to s7 primary and set section read-write T367857', diff saved to https://phabricator.wikimedia.org/P65219 and previous config saved to /var/cache/conftool/dbconfig/20240620-052253-marostegui.json

Change #1047023 merged by Marostegui:

[operations/dns@master] wmnet: Update s7-master alias

https://gerrit.wikimedia.org/r/1047023

Mentioned in SAL (#wikimedia-operations) [2024-06-20T05:24:00Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1236 T367857', diff saved to https://phabricator.wikimedia.org/P65220 and previous config saved to /var/cache/conftool/dbconfig/20240620-052359-root.json

Marostegui updated the task description. (Show Details)

This is done