Page MenuHomePhabricator

Switchover s5 master (db1130 -> db1100)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s5.dblist

Checklist:

NEW primary: db1100
OLD primary: db1130

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1130.eqiad.wmnet h=db1100.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s5 T326133" 'A:db-section-s5'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1100 set-weight 0
sudo dbctl config commit -m "Set db1100 with weight 0 T326133"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1130 db1100
  • Disable puppet on both nodes
sudo cumin 'db1130* or db1100*' 'disable-puppet "primary switchover T326133"'

Failover:

  • Log the failover:
!log Starting s5 eqiad failover from db1130 to db1100 - T326133
  • Set section read-only:
sudo dbctl --scope eqiad section s5 ro "Maintenance until 06:15 UTC - T326133"
sudo dbctl config commit -m "Set s5 eqiad as read-only for maintenance - T326133"
  • Check s5 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1130 db1100
echo "===== db1130 (OLD)"; sudo db-mysql db1130 -e 'show slave status\G'
echo "===== db1100 (NEW)"; sudo db-mysql db1100 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s5 set-master db1100
sudo dbctl --scope eqiad section s5 rw
sudo dbctl config commit -m "Promote db1100 to s5 primary and set section read-write T326133"
  • Restart puppet on both hosts:
sudo cumin 'db1130* or db1100*' 'run-puppet-agent -e "primary switchover T326133"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1100 heartbeat -e "delete from heartbeat where file like 'db1130%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1100
events_coredb_slave.sql on the new slave db1130
sudo dbctl instance db1130 set-candidate-master --section s5 true
sudo dbctl instance db1100 set-candidate-master --section s5 false
(dborch1001): sudo orchestrator-client -c untag -i db1100 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1130 --tag name=candidate
sudo db-mysql db1115 zarcillo -e "select * from masters where section = 's5';"
  • (If needed): Depool db1130 for maintenance.
sudo dbctl instance db1130 depool
sudo dbctl config commit -m "Depool db1130 T326133"
  • Change db1130 weight to mimic the previous weight db1100:
sudo dbctl instance db1130 edit
  • Apply outstanding schema changes to db1130 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 874826 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1100 to s5 master

https://gerrit.wikimedia.org/r/874826

Change 874827 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/874827

Ladsgroup triaged this task as Medium priority.
Ladsgroup moved this task from Triage to Ready on the DBA board.
Ladsgroup subscribed.

Scheduled for next week's Tuesday (10th Jan)

Mentioned in SAL (#wikimedia-operations) [2023-01-10T06:22:28Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133

Mentioned in SAL (#wikimedia-operations) [2023-01-10T06:22:56Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 25 hosts with reason: Primary switchover s5 T326133

Mentioned in SAL (#wikimedia-operations) [2023-01-10T06:23:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1100 with weight 0 T326133', diff saved to https://phabricator.wikimedia.org/P42938 and previous config saved to /var/cache/conftool/dbconfig/20230110-062309-ladsgroup.json

Change 874826 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1100 to s5 master

https://gerrit.wikimedia.org/r/874826

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:01:42Z] <Amir1> Starting s5 eqiad failover from db1130 to db1100 - T326133

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:01:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s5 eqiad as read-only for maintenance - T326133', diff saved to https://phabricator.wikimedia.org/P42939 and previous config saved to /var/cache/conftool/dbconfig/20230110-070152-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:02:23Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1100 to s5 primary and set section read-write T326133', diff saved to https://phabricator.wikimedia.org/P42940 and previous config saved to /var/cache/conftool/dbconfig/20230110-070223-ladsgroup.json

Change 874827 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s5-master alias

https://gerrit.wikimedia.org/r/874827

Mentioned in SAL (#wikimedia-operations) [2023-01-10T07:06:28Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1130 T326133', diff saved to https://phabricator.wikimedia.org/P42941 and previous config saved to /var/cache/conftool/dbconfig/20230110-070628-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.
Ladsgroup updated the task description. (Show Details)