Page MenuHomePhabricator

Switchover s4 master (db2179 -> db2140)
Closed, ResolvedPublic

Description

When: During a pre-defined DBA maintenance windows

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

  • Team calendar invite

Affected wikis:: https://noc.wikimedia.org/conf/highlight.php?file=dblists/s4.dblist

Checklist:

NEW primary: db2140
OLD primary: db2179

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=db2179.codfw.wmnet h=db2140.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover s4 T349820" 'A:db-section-s4'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance db2140 set-weight 0
sudo dbctl config commit -m "Set db2140 with weight 0 T349820"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move db2179 db2140
  • Disable puppet on both nodes
sudo cumin 'db2179* or db2140*' 'disable-puppet "primary switchover T349820"'
  • Merge gerrit puppet change to promote NEW primary: FIXME

Failover:

  • Log the failover:
!log Starting s4 codfw failover from db2179 to db2140 - T349820
  • Set section read-only:
sudo dbctl --scope codfw section s4 ro "Maintenance until 06:15 UTC - T349820"
sudo dbctl config commit -m "Set s4 codfw as read-only for maintenance - T349820"
  • Check s4 is indeed read-only
  • Switch primaries:
sudo db-switchover --skip-slave-move db2179 db2140
echo "===== db2179 (OLD)"; sudo db-mysql db2179 -e 'show slave status\G'
echo "===== db2140 (NEW)"; sudo db-mysql db2140 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section s4 set-master db2140
sudo dbctl --scope codfw section s4 rw
sudo dbctl config commit -m "Promote db2140 to s4 primary and set section read-write T349820"
  • Restart puppet on both hosts:
sudo cumin 'db2179* or db2140*' 'run-puppet-agent -e "primary switchover T349820"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql db2140 heartbeat -e "delete from heartbeat where file like 'db2179%';"
  • change events for query killer:
events_coredb_master.sql on the new primary db2140
events_coredb_slave.sql on the new slave db2179
  • Update DNS: FIXME
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance db2179 set-candidate-master --section s4 true
sudo dbctl instance db2140 set-candidate-master --section s4 false
(dborch1001): sudo orchestrator-client -c untag -i db2140 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i db2179 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 's4';"
  • (If needed): Depool db2179 for maintenance.
sudo dbctl instance db2179 depool
sudo dbctl config commit -m "Depool db2179 T349820"
  • Change db2179 weight to mimic the previous weight db2140:
sudo dbctl instance db2179 edit
  • Apply outstanding schema changes to db2179 (if any) to follow on T343198
  • Update/resolve this ticket.

Event Timeline

Change 968968 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote db2140 to s4 master

https://gerrit.wikimedia.org/r/968968

Change 968969 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/968969

ABran-WMF triaged this task as Medium priority.
ABran-WMF added a subscriber: Marostegui.
ABran-WMF subscribed.

What's the date/timeline for this? -- n.b. That info is always needed. -- I wonder if it could be added to Maintenance_bot's default Task Descriptions?

And is it just testcommonswiki that is changing? If so, it probably doesn't need a Tech News entry. (But might still deserve local notice(s)? @Trizek-WMF might know our precedent for this part?)

Based on previous experiences, these switchovers are only announced in Tech News, as they are regularly scheduled (hence the use of Maintenance Bots), and the read-only time is a few seconds. See T303605: Stop announcing and scheduling primary database switchovers for the context.

(s4 is commons and testcommons.)

Yeah, we don't add user-notice anymore.

Yeah, we don't add user-notice anymore.

This way we will avoid this mistake in the future!

Mentioned in SAL (#wikimedia-operations) [2023-10-31T06:33:29Z] <arnaudb@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820

Mentioned in SAL (#wikimedia-operations) [2023-10-31T06:33:56Z] <arnaudb@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: Primary switchover s4 T349820

Mentioned in SAL (#wikimedia-operations) [2023-10-31T06:36:48Z] <arnaudb@cumin1001> dbctl commit (dc=all): 'Set db2140 with weight 0 T349820', diff saved to https://phabricator.wikimedia.org/P53068 and previous config saved to /var/cache/conftool/dbconfig/20231031-063647-arnaudb.json

ABran-WMF updated the task description. (Show Details)

Change 968968 merged by Arnaudb:

[operations/puppet@production] mariadb: Promote db2140 to s4 master

https://gerrit.wikimedia.org/r/968968

Mentioned in SAL (#wikimedia-operations) [2023-10-31T07:02:47Z] <arnaudb> Starting s4 codfw failover from db2179 to db2140 - T349820

Mentioned in SAL (#wikimedia-operations) [2023-10-31T07:04:06Z] <arnaudb@cumin1001> dbctl commit (dc=all): 'Set s4 codfw as read-only for maintenance - T349820', diff saved to https://phabricator.wikimedia.org/P53070 and previous config saved to /var/cache/conftool/dbconfig/20231031-070405-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2023-10-31T07:05:49Z] <arnaudb@cumin1001> dbctl commit (dc=all): 'Promote db2140 to s4 primary and set section read-write T349820', diff saved to https://phabricator.wikimedia.org/P53071 and previous config saved to /var/cache/conftool/dbconfig/20231031-070549-arnaudb.json

Change 968969 merged by Arnaudb:

[operations/dns@master] wmnet: Update s4-master alias

https://gerrit.wikimedia.org/r/968969

ABran-WMF updated the task description. (Show Details)