Page MenuHomePhabricator

Switchover s6 master (db1131 -> db1173)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s6.dblist

Checklist:

NEW primary: db1173
OLD primary: db1131

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1131.eqiad.wmnet h=db1173.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s6 T320879" 'A:db-section-s6'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1173 set-weight 0
sudo dbctl config commit -m "Set db1173 with weight 0 T320879"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1131 db1173
  • Disable puppet on both nodes
sudo cumin 'db1131* or db1173*' 'disable-puppet "primary switchover T320879"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s6 eqiad failover from db1131 to db1173 - T320879
  • Set section read-only:
sudo dbctl --scope eqiad section s6 ro "Maintenance until 06:15 UTC - T320879"
sudo dbctl config commit -m "Set s6 eqiad as read-only for maintenance - T320879"
  • Check s6 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1131 db1173
echo "===== db1131 (OLD)"; sudo db-mysql db1131 -e 'show slave status\G'
echo "===== db1173 (NEW)"; sudo db-mysql db1173 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s6 set-master db1173
sudo dbctl --scope eqiad section s6 rw
sudo dbctl config commit -m "Promote db1173 to s6 primary and set section read-write T320879"
  • Restart puppet on both hosts:
sudo cumin 'db1131* or db1173*' 'run-puppet-agent -e "primary switchover T320879"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1173 heartbeat -e "delete from heartbeat where file like 'db1131%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1173
events_coredb_slave.sql on the new slave db1131
  • Update DNS: FIXME
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1131 set-candidate-master --section s6 true
sudo dbctl instance db1173 set-candidate-master --section s6 false
(dborch1001): sudo orchestrator-client -c untag -i db1173 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1131 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's6';"
  • (If needed): Depool db1131 for maintenance.
sudo dbctl instance db1131 depool
sudo dbctl config commit -m "Depool db1131 T320879"
  • Change db1131 weight to mimic the previous weight db1173:
sudo dbctl instance db1131 edit
  • Apply outstanding schema changes to db1131 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 842400 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1173 to s6 master

https://gerrit.wikimedia.org/r/842400

Change 842401 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/842401

Mentioned in SAL (#wikimedia-operations) [2022-10-15T22:44:06Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T320879

Mentioned in SAL (#wikimedia-operations) [2022-10-15T22:44:24Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T320879

Mentioned in SAL (#wikimedia-operations) [2022-10-15T22:44:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1173 with weight 0 T320879', diff saved to https://phabricator.wikimedia.org/P35492 and previous config saved to /var/cache/conftool/dbconfig/20221015-224455-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-10-15T22:54:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T320879', diff saved to https://phabricator.wikimedia.org/P35493 and previous config saved to /var/cache/conftool/dbconfig/20221015-225454-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-10-15T22:58:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T320879', diff saved to https://phabricator.wikimedia.org/P35494 and previous config saved to /var/cache/conftool/dbconfig/20221015-225858-ladsgroup.json

Ladsgroup triaged this task as Unbreak Now! priority.

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:10:12Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T320879

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:10:30Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T320879

Change 842400 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1173 to s6 master

https://gerrit.wikimedia.org/r/842400

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:22:49Z] <Amir1> Starting s6 eqiad failover from db1131 to db1173 - T320879

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:23:20Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T320879', diff saved to https://phabricator.wikimedia.org/P35495 and previous config saved to /var/cache/conftool/dbconfig/20221015-232320-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:23:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1173 to s6 primary and set section read-write T320879', diff saved to https://phabricator.wikimedia.org/P35496 and previous config saved to /var/cache/conftool/dbconfig/20221015-232351-ladsgroup.json

Change 842401 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s6-master alias

https://gerrit.wikimedia.org/r/842401

Mentioned in SAL (#wikimedia-operations) [2022-10-15T23:27:16Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1131 T320879', diff saved to https://phabricator.wikimedia.org/P35497 and previous config saved to /var/cache/conftool/dbconfig/20221015-232716-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)
Ladsgroup moved this task from Triage to Done on the DBA board.

Noting this was an emergency switchover caused by master becoming fully unreachable: https://www.wikimediastatus.net/incidents/hnm5c223c26v

Quiddity subscribed.

Thanks @Ladsgroup - I've moved the tag to T320990 and linked to that task in the entry at https://meta.wikimedia.org/wiki/Tech/News/2022/43 (edit/revert freely!)