Page MenuHomePhabricator

Switchover es4 codfw master es2021 -> es2020
Closed, ResolvedPublic

Description

When: Anytime, writes will be disabled

Checklist:

NEW primary: es2020
OLD primary: es2021

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es2021.codfw.wmnet h=es2020.codfw.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover es4 T356372" 'A:db-section-es4'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance es2020 set-weight 0
sudo dbctl config commit -m "Set es2020 with weight 0 T356372"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move es2021 es2020
  • Disable puppet on both nodes
sudo cumin 'es2020* or es2021*' 'disable-puppet "primary switchover T356372"'

Failover:

  • Log the failover:
!log Starting es4 codfw failover from es2021 to es2020 - T356372
  • Switch primaries:
sudo db-switchover --skip-slave-move es2021 es2020
echo "===== es1020 (OLD)"; sudo db-mysql es2021 -e 'show slave status\G'
echo "===== es1021 (NEW)"; sudo db-mysql es2020 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section es4 set-master es2020
sudo dbctl config commit -m "Promote es2020 to es4 primary T356372"
  • Restart puppet on both hosts:
sudo cumin 'es2021* or es2020*' 'run-puppet-agent -e "primary switchover T356372"'

Clean up tasks:

  • Clean up heartbeat table(s).
sudo db-mysql es2020 heartbeat -e "delete from heartbeat where file like 'es2021%';"
  • change events for query killer:
events_coredb_master.sql on the new primary es2020
events_coredb_slave.sql on the new slave es2021
sudo dbctl instance es2021 set-candidate-master --section es4 true
sudo dbctl instance es2020 set-candidate-master --section es4 false
(dborch1001): sudo orchestrator-client -c untag -i es2020 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i es2021 --tag name=candidate
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'es4';"
  • (If needed): Depool es2021 for maintenance.
sudo dbctl instance es2021 depool
sudo dbctl config commit -m "Depool es2021 T356372"
  • Change es2021 weight to mimic the previous weight es2020:
sudo dbctl instance es2021 edit
  • Update/resolve this ticket.

Event Timeline

Marostegui renamed this task from Switchover es5 codfw master es2023 -> es2024 to Switchover es4 codfw master es2021 -> es2020.Feb 1 2024, 6:45 AM
Marostegui changed the task status from Open to Stalled.
Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)

This needs to be done once T355862: Migrate servers in codfw rack A3 from asw-a3-codfw to lsw1-a3-codfw (Thur Feb 8 16:00 UTC) is done and it must be done before T355870: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw (Feb 27 16:00 UTC)

Marostegui changed the task status from Stalled to Open.Feb 19 2024, 6:03 AM
Marostegui moved this task from Blocked to In progress on the DBA board.

This will happen tomorrow

Change 1004668 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] db-production.php: Disable writes on es4

https://gerrit.wikimedia.org/r/1004668

Change 1004669 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote es2020 to es4 master

https://gerrit.wikimedia.org/r/1004669

Change 1004670 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Promote es2020 to es4 master

https://gerrit.wikimedia.org/r/1004670

Change 1004668 merged by jenkins-bot:

[operations/mediawiki-config@master] db-production.php: Disable writes on es4

https://gerrit.wikimedia.org/r/1004668

Mentioned in SAL (#wikimedia-operations) [2024-02-20T05:50:27Z] <marostegui@deploy2002> Started scap: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-20T05:52:02Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:00:04Z] <marostegui@deploy2002> Finished scap: Backport for [[gerrit:1004668|db-production.php: Disable writes on es4 (T356372)]] (duration: 09m 36s)

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:03:16Z] <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356372

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:03:33Z] <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es4 T356372

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:04:05Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set es2020 with weight 0 T356372', diff saved to https://phabricator.wikimedia.org/P57193 and previous config saved to /var/cache/conftool/dbconfig/20240220-060404-marostegui.json

Change 1004669 merged by Marostegui:

[operations/puppet@production] mariadb: Promote es2020 to es4 master

https://gerrit.wikimedia.org/r/1004669

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:08:25Z] <marostegui> Starting es4 codfw failover from es2021 to es2020 - T356372

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:08:53Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote es2020 to es4 primary T356372', diff saved to https://phabricator.wikimedia.org/P57194 and previous config saved to /var/cache/conftool/dbconfig/20240220-060852-marostegui.json

Change 1004670 merged by Marostegui:

[operations/dns@master] wmnet: Promote es2020 to es4 master

https://gerrit.wikimedia.org/r/1004670

Mentioned in SAL (#wikimedia-operations) [2024-02-20T06:10:25Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool es2021 T356372', diff saved to https://phabricator.wikimedia.org/P57195 and previous config saved to /var/cache/conftool/dbconfig/20240220-061025-marostegui.json

Marostegui updated the task description. (Show Details)

This was done