Page MenuHomePhabricator

Switchover s1 master (db1163 -> db1184)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s1.dblist

Checklist:

NEW primary: db1184
OLD primary: db1163

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1163.eqiad.wmnet h=db1184.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s1 T344621" 'A:db-section-s1'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1184 set-weight 0
sudo dbctl config commit -m "Set db1184 with weight 0 T344621"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1163 db1184
  • Disable puppet on both nodes
sudo cumin 'db1163* or db1184*' 'disable-puppet "primary switchover T344621"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s1 eqiad failover from db1163 to db1184 - T344621
  • Set section read-only:
sudo dbctl --scope eqiad section s1 ro "Maintenance until 06:15 UTC - T344621"
sudo dbctl config commit -m "Set s1 eqiad as read-only for maintenance - T344621"
  • Check s1 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1163 db1184
echo "===== db1163 (OLD)"; sudo db-mysql db1163 -e 'show slave status\G'
echo "===== db1184 (NEW)"; sudo db-mysql db1184 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s1 set-master db1184
sudo dbctl --scope eqiad section s1 rw
sudo dbctl config commit -m "Promote db1184 to s1 primary and set section read-write T344621"
  • Restart puppet on both hosts:
sudo cumin 'db1163* or db1184*' 'run-puppet-agent -e "primary switchover T344621"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1184 heartbeat -e "delete from heartbeat where file like 'db1163%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1184
events_coredb_slave.sql on the new slave db1163
  • Update DNS: FIXME
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1163 set-candidate-master --section s1 true
sudo dbctl instance db1184 set-candidate-master --section s1 false
(dborch1001): sudo orchestrator-client -c untag -i db1184 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1163 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's1';"
  • (If needed): Depool db1163 for maintenance.
sudo dbctl instance db1163 depool
sudo dbctl config commit -m "Depool db1163 T344621"
  • Change db1163 weight to mimic the previous weight db1184:
sudo dbctl instance db1163 edit
  • Apply outstanding schema changes to db1163 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 951088 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1184 to s1 master

https://gerrit.wikimedia.org/r/951088

Change 951089 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/951089

Marostegui triaged this task as Medium priority.Aug 21 2023, 3:58 PM
Marostegui added a subscriber: Ladsgroup.

I woke up early enough. I do it!

Mentioned in SAL (#wikimedia-operations) [2023-08-22T05:13:02Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344621

Mentioned in SAL (#wikimedia-operations) [2023-08-22T05:13:27Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T344621

Mentioned in SAL (#wikimedia-operations) [2023-08-22T05:13:47Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1184 with weight 0 T344621', diff saved to https://phabricator.wikimedia.org/P50760 and previous config saved to /var/cache/conftool/dbconfig/20230822-051347-ladsgroup.json

Change 951088 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1184 to s1 master

https://gerrit.wikimedia.org/r/951088

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:00:44Z] <Amir1> Starting s1 eqiad failover from db1163 to db1184 - T344621

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:01:04Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T344621', diff saved to https://phabricator.wikimedia.org/P50778 and previous config saved to /var/cache/conftool/dbconfig/20230822-060104-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:01:32Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1184 to s1 primary and set section read-write T344621', diff saved to https://phabricator.wikimedia.org/P50779 and previous config saved to /var/cache/conftool/dbconfig/20230822-060131-ladsgroup.json

Change 951089 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s1-master alias

https://gerrit.wikimedia.org/r/951089

Mentioned in SAL (#wikimedia-operations) [2023-08-22T06:07:11Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1163 T344621', diff saved to https://phabricator.wikimedia.org/P50782 and previous config saved to /var/cache/conftool/dbconfig/20230822-060710-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from In progress to Done on the DBA board.