Page MenuHomePhabricator

switchover es4 master es1020 -> es1021
Closed, ResolvedPublic

Description

When: Tue 23rd 06:00 AM UTC

NEW primary: es1021
OLD primary: es1020

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es1020.eqiad.wmnet h=es1021.eqiad.wmnet

Failover prep:

sudo cookbook sre.hosts.downtime --hours 2 -r "Switchover es4 T315540" 'A:db-section-es4'
  • Set NEW primary with weight 10
sudo dbctl instance es1021 set-weight 10
sudo dbctl config commit -m "Set es1021 with weight 10 T315540"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move es1020 es1021
  • Disable puppet on both nodes
sudo cumin 'es1020* or es1021*' 'disable-puppet "primary switchover T315540"'

Failover:

  • Log the failover:
!log Starting es4 eqiad failover from es1020 to es1021 - T315540
  • Switch primaries:
sudo db-switchover --skip-slave-move es1020 es1021
echo "===== es1020 (OLD)"; sudo db-mysql es1020 -e 'show slave status\G'
echo "===== es1021 (NEW)"; sudo db-mysql es1021 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section es4 set-master es1021
sudo dbctl config commit -m "Promote es1021 to es4 primary T315540"
  • Restart puppet on both hosts:
sudo cumin 'es1020* or es1021*' 'run-puppet-agent -e "primary switchover T315540"'

Clean up tasks:

  • Clean up heartbeat table(s).
  • change events for query killer:
events_coredb_master.sql on the new primary es1021
events_coredb_slave.sql on the new slave es1020
sudo dbctl instance es1020 set-candidate-master --section es4 true
sudo dbctl instance es1021 set-candidate-master --section es4 false
(dborch1001): sudo orchestrator-client -c untag -i es1021 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i es1020 --tag name=candidate
sudo dbctl instance es1020 depool
sudo dbctl config commit -m "Depool es1020 for reboot T310485"

Event Timeline

Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to In progress on the DBA board.

Not tagging User-notice as this will have no impact on reads or writes. I will be disabling writes (that will go to es5) and reads will be unaffected

Change 825234 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] db-production: Set es4 as RO

https://gerrit.wikimedia.org/r/825234

Change 825234 merged by jenkins-bot:

[operations/mediawiki-config@master] db-production: Set es4 as RO

https://gerrit.wikimedia.org/r/825234

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:11:17Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Disable writes on es4 T315540 (duration: 03m 35s)

Change 825245 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote es1021 to es4 master

https://gerrit.wikimedia.org/r/825245

Change 825247 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update es-master CNAME

https://gerrit.wikimedia.org/r/825247

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:17:55Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es4 T315540

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:18:01Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es4 T315540

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:18:18Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set es1021 with weight 10 T315540', diff saved to https://phabricator.wikimedia.org/P32692 and previous config saved to /var/cache/conftool/dbconfig/20220822-081817-root.json

Change 825245 merged by Marostegui:

[operations/puppet@production] mariadb: Promote es1021 to es4 master

https://gerrit.wikimedia.org/r/825245

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:21:20Z] <marostegui> Starting es4 eqiad failover from es1020 to es1021 - T315540

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:22:09Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote es1021 to es4 primary T315540', diff saved to https://phabricator.wikimedia.org/P32694 and previous config saved to /var/cache/conftool/dbconfig/20220822-082208-root.json

Change 825247 merged by Marostegui:

[operations/dns@master] wmnet: Update es-master CNAME

https://gerrit.wikimedia.org/r/825247

Mentioned in SAL (#wikimedia-operations) [2022-08-22T08:32:23Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Enable writes on es4 T315540 (duration: 03m 17s)

Marostegui updated the task description. (Show Details)

All done