Page MenuHomePhabricator

Switchover s6 master (db1173 -> db1131)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db1131
OLD primary: db1173

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1173.eqiad.wmnet h=db1131.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T326134" 'A:db-section-s6'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1131 set-weight 0
sudo dbctl config commit -m "Set db1131 with weight 0 T326134"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1173 db1131
  • Disable puppet on both nodes
sudo cumin 'db1173* or db1131*' 'disable-puppet "primary switchover T326134"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s6 eqiad failover from db1173 to db1131 - T326134
  • Set section read-only:
sudo dbctl --scope eqiad section s6 ro "Maintenance until 06:15 UTC - T326134"
sudo dbctl config commit -m "Set s6 eqiad as read-only for maintenance - T326134"
  • Check s6 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1173 db1131
echo "===== db1173 (OLD)"; sudo db-mysql db1173 -e 'show slave status\G'
echo "===== db1131 (NEW)"; sudo db-mysql db1131 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s6 set-master db1131
sudo dbctl --scope eqiad section s6 rw
sudo dbctl config commit -m "Promote db1131 to s6 primary and set section read-write T326134"
  • Restart puppet on both hosts:
sudo cumin 'db1173* or db1131*' 'run-puppet-agent -e "primary switchover T326134"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1131 heartbeat -e "delete from heartbeat where file like 'db1173%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1131
events_coredb_slave.sql on the new slave db1173
  • Update DNS: FIXME
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1173 set-candidate-master --section s6 true
sudo dbctl instance db1131 set-candidate-master --section s6 false
(dborch1001): sudo orchestrator-client -c untag -i db1131 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1173 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db1173 for maintenance.
sudo dbctl instance db1173 depool
sudo dbctl config commit -m "Depool db1173 T326134"
  • Change db1173 weight to mimic the previous weight db1131:
sudo dbctl instance db1173 edit
  • Apply outstanding schema changes to db1173 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 874828 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1131 to s6 master

https://gerrit.wikimedia.org/r/874828

Change 874829 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/874829

Ladsgroup triaged this task as Medium priority.
Ladsgroup moved this task from Triage to Ready on the DBA board.
Ladsgroup added a project: User-notice.
Ladsgroup added a subscriber: Ladsgroup.

Scheduled for next week's Thursday (12th Jan)

I'm doing it now because we missed the previous window.

Mentioned in SAL (#wikimedia-operations) [2023-01-17T06:06:38Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T326134

Mentioned in SAL (#wikimedia-operations) [2023-01-17T06:06:57Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T326134

Mentioned in SAL (#wikimedia-operations) [2023-01-17T06:07:11Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1131 with weight 0 T326134', diff saved to https://phabricator.wikimedia.org/P43160 and previous config saved to /var/cache/conftool/dbconfig/20230117-060710-ladsgroup.json

Change 874828 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1131 to s6 master

https://gerrit.wikimedia.org/r/874828

Mentioned in SAL (#wikimedia-operations) [2023-01-17T07:00:19Z] <Amir1> Starting s6 eqiad failover from db1173 to db1131 - T326134

Mentioned in SAL (#wikimedia-operations) [2023-01-17T07:01:23Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T326134', diff saved to https://phabricator.wikimedia.org/P43162 and previous config saved to /var/cache/conftool/dbconfig/20230117-070035-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-01-17T07:02:29Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1131 to s6 primary and set section read-write T326134', diff saved to https://phabricator.wikimedia.org/P43163 and previous config saved to /var/cache/conftool/dbconfig/20230117-070102-ladsgroup.json

Change 874829 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/874829

Mentioned in SAL (#wikimedia-operations) [2023-01-17T07:05:32Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1173 T326134', diff saved to https://phabricator.wikimedia.org/P43164 and previous config saved to /var/cache/conftool/dbconfig/20230117-070532-ladsgroup.json

Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from Ready to Done on the DBA board.