Page MenuHomePhabricator

Switchover es7 master (es2038 -> es2039)
Closed, ResolvedPublic

Description

When: Anytime, writes will be disabled

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

Checklist:

NEW primary: es2039
OLD primary: es2038

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es2038.codfw.wmnet h=es2039.codfw.wmnet

(in deployment.eqiad.org):

scap backport 1122609
  • Check es7 is indeed read-only

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover es7 T387224" 'A:db-section-es7'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance es2039 set-weight 0
sudo dbctl config commit -m "Set es2039 with weight 0 T387224"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=25 --only-slave-move es2038 es2039
  • Disable puppet on both nodes
sudo cumin 'es2038* or es2039*' 'disable-puppet "primary switchover T387224"'

Failover:

  • Log the failover:
!log Starting es7 codfw failover from es2038 to es2039 - T387224
  • Switch primaries:
sudo db-switchover --skip-slave-move es2038 es2039
echo "===== es2038 (OLD)"; sudo db-mysql es2038 -e 'show slave status\G'
echo "===== es2039 (NEW)"; sudo db-mysql es2039 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope codfw section es7 set-master es2039
sudo dbctl config commit -m "Promote es2039 to es7 primary T387224"
  • Clean up heartbeat table(s).
sudo db-mysql es2039 heartbeat -e "delete from heartbeat where file like 'es2038%';"
  • Restart puppet on both hosts:
sudo cumin 'es2038* or es2039*' 'run-puppet-agent -e "primary switchover T387224"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es2039
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql es2038
sudo dbctl instance es2038 set-candidate-master --section es7 true
sudo dbctl instance es2039 set-candidate-master --section es7 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i es2039 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i es2038 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'es7';"
  • (If needed): Depool es2038 for maintenance.
sudo dbctl instance es2038 depool
sudo dbctl config commit -m "Depool es2038 T387224"
  • Change es2038 weight to mimic the previous weight es2039:
sudo dbctl instance es2038 edit
  • Enable writes in es7 by merging and deploying revert of mediawiki config patch:

(in deployment.eqiad.org)

scap backport 922377

FIXME: That number should be the id of gerrit patch of the revert.

  • Update/resolve this ticket.

Event Timeline

Change #1122607 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote es2039 to es7 master

https://gerrit.wikimedia.org/r/1122607

Change #1122608 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/dns@master] wmnet: Update es7-master alias

https://gerrit.wikimedia.org/r/1122608

Change #1122609 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] db-production.php: Disable writes on es7

https://gerrit.wikimedia.org/r/1122609

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.
Marostegui added a parent task: Restricted Task.

Mentioned in SAL (#wikimedia-operations) [2025-02-25T15:51:39Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387224

Change #1122609 merged by jenkins-bot:

[operations/mediawiki-config@master] db-production.php: Disable writes on es7

https://gerrit.wikimedia.org/r/1122609

Mentioned in SAL (#wikimedia-operations) [2025-02-25T15:52:29Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set es2039 with weight 0 T387224', diff saved to https://phabricator.wikimedia.org/P73582 and previous config saved to /var/cache/conftool/dbconfig/20250225-155229-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-02-25T15:53:04Z] <marostegui@deploy2002> Started scap sync-world: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-25T15:57:43Z] <marostegui@deploy2002> marostegui: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:04:13Z] <marostegui@deploy2002> Finished scap sync-world: Backport for [[gerrit:1122609|db-production.php: Disable writes on es7 (T387224)]] (duration: 11m 09s)

Change #1122607 merged by Marostegui:

[operations/puppet@production] mariadb: Promote es2039 to es7 master

https://gerrit.wikimedia.org/r/1122607

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:06:30Z] <marostegui> Starting es7 codfw failover from es2038 to es2039 - T387224

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:06:59Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote es2039 to es7 primary T387224', diff saved to https://phabricator.wikimedia.org/P73583 and previous config saved to /var/cache/conftool/dbconfig/20250225-160659-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:12:28Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es7 T387224

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:18:23Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set es2038 with weight 0 T387224', diff saved to https://phabricator.wikimedia.org/P73584 and previous config saved to /var/cache/conftool/dbconfig/20250225-161823-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2025-02-25T16:20:02Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote es2038 to es7 primary T387224', diff saved to https://phabricator.wikimedia.org/P73586 and previous config saved to /var/cache/conftool/dbconfig/20250225-162001-marostegui.json

I've had to revert this change in the middle of the switch as es2039 became unresponsive.

All reverted and writes enabled back on es7.
The real goal of this task was to upgrade the kernel of es2038, which was done anyway. So I guess it was a "success"

Leaving all the non executed steps unticked as they were never ran. But they've been on the revert part.