Page MenuHomePhabricator

Switchover s3 master (db1123 -> db1157)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s3.dblist

Checklist:

NEW primary: db1157
OLD primary: db1123

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1123.eqiad.wmnet h=db1157.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s3 T323546" 'A:db-section-s3'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1157 set-weight 0
sudo dbctl config commit -m "Set db1157 with weight 0 T323546"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1123 db1157
  • Disable puppet on both nodes
sudo cumin 'db1123* or db1157*' 'disable-puppet "primary switchover T323546"'

Failover:

  • Log the failover:
!log Starting s3 eqiad failover from db1123 to db1157 - T323546
  • Set section read-only:
sudo dbctl --scope eqiad section s3 ro "Maintenance until 06:15 UTC - T323546"
sudo dbctl config commit -m "Set s3 eqiad as read-only for maintenance - T323546"
  • Check s3 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1123 db1157
echo "===== db1123 (OLD)"; sudo db-mysql db1123 -e 'show slave status\G'
echo "===== db1157 (NEW)"; sudo db-mysql db1157 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s3 set-master db1157
sudo dbctl --scope eqiad section s3 rw
sudo dbctl config commit -m "Promote db1157 to s3 primary and set section read-write T323546"
  • Restart puppet on both hosts:
sudo cumin 'db1123* or db1157*' 'run-puppet-agent -e "primary switchover T323546"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1157 heartbeat -e "delete from heartbeat where file like 'db1123%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1157
events_coredb_slave.sql on the new slave db1123
sudo dbctl instance db1123 set-candidate-master --section s3 true
sudo dbctl instance db1157 set-candidate-master --section s3 false
(dborch1001): sudo orchestrator-client -c untag -i db1157 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1123 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's3';"
  • (If needed): Depool db1123 for maintenance.
sudo dbctl instance db1123 depool
sudo dbctl config commit -m "Depool db1123 T323546"
  • Change db1123 weight to mimic the previous weight db1157:
sudo dbctl instance db1123 edit
  • Apply outstanding schema changes to db1123 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 858380 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1157 to s3 master

https://gerrit.wikimedia.org/r/858380

Change 858381 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/858381

Marostegui moved this task from In progress to Ready on the DBA board.
Ladsgroup subscribed.

Scheduled for Tuesday Nov 29th.

Ladsgroup moved this task from Ready to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2022-11-29T05:45:48Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 23 hosts with reason: Primary switchover s3 T323546

Mentioned in SAL (#wikimedia-operations) [2022-11-29T05:46:15Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 23 hosts with reason: Primary switchover s3 T323546

Mentioned in SAL (#wikimedia-operations) [2022-11-29T05:47:17Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1157 with weight 0 T323546', diff saved to https://phabricator.wikimedia.org/P41577 and previous config saved to /var/cache/conftool/dbconfig/20221129-054717-ladsgroup.json

Change 858380 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1157 to s3 master

https://gerrit.wikimedia.org/r/858380

Mentioned in SAL (#wikimedia-operations) [2022-11-29T07:00:07Z] <Amir1> Starting s3 eqiad failover from db1123 to db1157 - T323546

Mentioned in SAL (#wikimedia-operations) [2022-11-29T07:00:33Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T323546', diff saved to https://phabricator.wikimedia.org/P41591 and previous config saved to /var/cache/conftool/dbconfig/20221129-070032-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-11-29T07:01:03Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1157 to s3 primary and set section read-write T323546', diff saved to https://phabricator.wikimedia.org/P41592 and previous config saved to /var/cache/conftool/dbconfig/20221129-070102-ladsgroup.json

Change 858381 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s3-master alias

https://gerrit.wikimedia.org/r/858381

Mentioned in SAL (#wikimedia-operations) [2022-11-29T07:06:38Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1123 T323546', diff saved to https://phabricator.wikimedia.org/P41594 and previous config saved to /var/cache/conftool/dbconfig/20221129-070637-ladsgroup.json

Ladsgroup triaged this task as Medium priority.
Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)