Page MenuHomePhabricator

Switchover es6 master (es1038 -> es1037)
Closed, ResolvedPublic

Description

When: Anytime - no in use

Prerequisites: https://wikitech.wikimedia.org/wiki/MariaDB/Primary_switchover

Checklist:

NEW primary: es1037
OLD primary: es1038

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es1038.eqiad.wmnet h=es1037.eqiad.wmnet

Failover prep:

  • Silence alerts on all hosts:
sudo cookbook sre.hosts.downtime --hours 1 -r "Primary switchover es6 T387273" 'A:db-section-es6'
  • Set NEW primary with weight 0 (and depool it from API or vslow/dump groups if it is present).
sudo dbctl instance es1037 set-weight 0
sudo dbctl config commit -m "Set es1037 with weight 0 T387273"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --read-only-master --replicating-master --timeout=25 --only-slave-move es1038 es1037
  • Disable puppet on both nodes
sudo cumin 'es1038* or es1037*' 'disable-puppet "primary switchover T387273"'

Failover:

  • Log the failover:
!log Starting es6 eqiad failover from es1038 to es1037 - T387273
  • Switch primaries:
sudo db-switchover --read-only-master --replicating-master  --skip-slave-move es1038 es1037
echo "===== es1038 (OLD)"; sudo db-mysql es1038 -e 'show slave status\G'
echo "===== es1037 (NEW)"; sudo db-mysql es1037 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section es6 set-master es1037
sudo dbctl --scope eqiad section es6 rw
sudo dbctl config commit -m "Promote es1037 to es6 primary and set section read-write T387273"
  • Clean up heartbeat table(s).
sudo db-mysql es1037 heartbeat -e "delete from heartbeat where file like 'es1038%';"
  • Restart puppet on both hosts:
sudo cumin 'es1038* or es1037*' 'run-puppet-agent -e "primary switchover T387273"'

Clean up tasks:

  • change events for query killer:
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_master.sql?format=TEXT' | base64 -d | sudo db-mysql es1037
curl -sS 'https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/refs/heads/master/dbtools/events_coredb_slave.sql?format=TEXT' | base64 -d | sudo db-mysql es1038
  • Update candidate primary dbctl and orchestrator notes
sudo dbctl instance es1038 set-candidate-master --section es6 true
sudo dbctl instance es1037 set-candidate-master --section es6 false
sudo cumin 'dborch*' 'orchestrator-client -c untag -i es1037 --tag name=candidate'
sudo cumin 'dborch*' 'orchestrator-client -c tag -i es1038 --tag name=candidate'
sudo db-mysql db1215 zarcillo -e "select * from masters where section = 'es6';"
  • (If needed): Depool es1038 for maintenance.
sudo dbctl instance es1038 depool
sudo dbctl config commit -m "Depool es1038 T387273"
  • Change es1038 weight to mimic the previous weight es1037:
sudo dbctl instance es1038 edit
  • Update/resolve this ticket.

Details

Event Timeline

Change #1122894 had a related patch set uploaded (by Gerrit maintenance bot; author: Gerrit maintenance bot):

[operations/puppet@production] mariadb: Promote es1037 to es6 master

https://gerrit.wikimedia.org/r/1122894

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2025-02-26T12:06:49Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Set es1037 with weight 0 T387273', diff saved to https://phabricator.wikimedia.org/P73676 and previous config saved to /var/cache/conftool/dbconfig/20250226-120649-root.json

Mentioned in SAL (#wikimedia-operations) [2025-02-26T12:06:53Z] <marostegui@cumin1002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover es6 T387273

Change #1122894 merged by Marostegui:

[operations/puppet@production] mariadb: Promote es1037 to es6 master

https://gerrit.wikimedia.org/r/1122894

Mentioned in SAL (#wikimedia-operations) [2025-02-26T12:07:44Z] <marostegui> Starting es6 eqiad failover from es1038 to es1037 - T387273

Mentioned in SAL (#wikimedia-operations) [2025-02-26T12:08:06Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Promote es1037 to es6 primary and set section read-write T387273', diff saved to https://phabricator.wikimedia.org/P73677 and previous config saved to /var/cache/conftool/dbconfig/20250226-120806-root.json

Mentioned in SAL (#wikimedia-operations) [2025-02-26T12:08:49Z] <marostegui@cumin1002> dbctl commit (dc=all): 'Depool es1038 T387273', diff saved to https://phabricator.wikimedia.org/P73678 and previous config saved to /var/cache/conftool/dbconfig/20250226-120848-root.json

Marostegui updated the task description. (Show Details)

This is done