Page MenuHomePhabricator

Switchover s4 master (db1160 -> db1138)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s4.dblist

Checklist:

NEW primary: db1138
OLD primary: db1160

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db1160.eqiad.wmnet h=db1138.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s4 T344881" 'A:db-section-s4'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db1138 set-weight 0
sudo dbctl config commit -m "Set db1138 with weight 0 T344881"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db1160 db1138
  • Disable puppet on both nodes
sudo cumin 'db1160* or db1138*' 'disable-puppet "primary switchover T344881"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s4 eqiad failover from db1160 to db1138 - T344881
  • Set section read-only:
sudo dbctl --scope eqiad section s4 ro "Maintenance until 06:15 UTC - T344881"
sudo dbctl config commit -m "Set s4 eqiad as read-only for maintenance - T344881"
  • Check s4 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db1160 db1138
echo "===== db1160 (OLD)"; sudo db-mysql db1160 -e 'show slave status\G'
echo "===== db1138 (NEW)"; sudo db-mysql db1138 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section s4 set-master db1138
sudo dbctl --scope eqiad section s4 rw
sudo dbctl config commit -m "Promote db1138 to s4 primary and set section read-write T344881"
  • Restart puppet on both hosts:
sudo cumin 'db1160* or db1138*' 'run-puppet-agent -e "primary switchover T344881"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db1138 heartbeat -e "delete from heartbeat where file like 'db1160%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db1138
events_coredb_slave.sql on the new slave db1160
  • Update DNS: FIXME
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db1160 set-candidate-master --section s4 true
sudo dbctl instance db1138 set-candidate-master --section s4 false
(dborch1001): sudo orchestrator-client -c untag -i db1138 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db1160 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's4';"
  • (If needed): Depool db1160 for maintenance.
sudo dbctl instance db1160 depool
sudo dbctl config commit -m "Depool db1160 T344881"
  • Change db1160 weight to mimic the previous weight db1138:
sudo dbctl instance db1160 edit
  • Apply outstanding schema changes to db1160 (if any)
  • Update/resolve this ticket.

Event Timeline

Change 951870 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/951870

Change 951871 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/951871

Mentioned in SAL (#wikimedia-operations) [2023-08-24T05:19:17Z] <ladsgroup@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881

Mentioned in SAL (#wikimedia-operations) [2023-08-24T05:19:42Z] <ladsgroup@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T344881

Mentioned in SAL (#wikimedia-operations) [2023-08-24T05:19:52Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set db1138 with weight 0 T344881', diff saved to https://phabricator.wikimedia.org/P51154 and previous config saved to /var/cache/conftool/dbconfig/20230824-051951-ladsgroup.json

Change 951870 merged by Ladsgroup:

[operations/puppet@production] mariadb: Promote db1138 to s4 master

https://gerrit.wikimedia.org/r/951870

Mentioned in SAL (#wikimedia-operations) [2023-08-24T06:01:46Z] <Amir1> Starting s4 eqiad failover from db1160 to db1138 - T344881

Mentioned in SAL (#wikimedia-operations) [2023-08-24T06:01:58Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T344881', diff saved to https://phabricator.wikimedia.org/P51173 and previous config saved to /var/cache/conftool/dbconfig/20230824-060157-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2023-08-24T06:02:46Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Promote db1138 to s4 primary and set section read-write T344881', diff saved to https://phabricator.wikimedia.org/P51174 and previous config saved to /var/cache/conftool/dbconfig/20230824-060245-ladsgroup.json

Change 951871 merged by Ladsgroup:

[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/951871

Mentioned in SAL (#wikimedia-operations) [2023-08-24T06:06:48Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depool db1160 T344881', diff saved to https://phabricator.wikimedia.org/P51175 and previous config saved to /var/cache/conftool/dbconfig/20230824-060647-ladsgroup.json

Ladsgroup claimed this task.
Ladsgroup triaged this task as Medium priority.
Ladsgroup updated the task description. (Show Details)
Ladsgroup changed the edit policy from "Custom Policy" to "All Users".
Ladsgroup moved this task from Triage to Done on the DBA board.