Page MenuHomePhabricator

switchover es5 master es1023 -> es1024
Closed, ResolvedPublic

Description

When: Thu 25th 06:00 AM UTC

NEW primary: es1023
OLD primary: es1024

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es1023.eqiad.wmnet h=es1024.eqiad.wmnet

Failover prep:

sudo cookbook sre.hosts.downtime --hours 2 -r "Switchover es5 T315542" 'A:db-section-es5'
  • Set NEW primary with weight 10
sudo dbctl instance es1024 set-weight 10
sudo dbctl config commit -m "Set es1024 with weight 10 T315542"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move es1023 es1024
  • Disable puppet on both nodes
sudo cumin 'es1023* or es1024*' 'disable-puppet "primary switchover T315542"'

Failover:

  • Log the failover:
!log Starting es5 eqiad failover from es1023 to es1024 - T315542
  • Switch primaries:
sudo db-switchover --skip-slave-move es1023 es1024
echo "===== es1023 (OLD)"; sudo db-mysql es1023 -e 'show slave status\G'
echo "===== es1024 (NEW)"; sudo db-mysql es1024 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section es5 set-master es1024
sudo dbctl config commit -m "Promote es1024 to es5 primary T315542"
  • Restart puppet on both hosts:
sudo cumin 'es1023* or es1024*' 'run-puppet-agent -e "primary switchover T315542"'

Clean up tasks:

  • Clean up heartbeat table(s).
  • change events for query killer:
events_coredb_master.sql on the new primary es1024
events_coredb_slave.sql on the new slave es1023
sudo dbctl instance es1023 set-candidate-master --section es5 true
sudo dbctl instance es1024 set-candidate-master --section es5 false
(dborch1001): sudo orchestrator-client -c untag -i es1024 --tag name=candidate
(dborch1001): sudo orchestrator-client -c tag -i es1023 --tag name=candidate
sudo dbctl instance es1023 depool
sudo dbctl config commit -m "Depool es1023 for reboot T315542"

Event Timeline

Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Not tagging User-notice as this will have no impact on reads or writes. I will be disabling writes (that will go to es4) and reads will be unaffected

Change 825328 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] db-production.php: Disable writes on es5

https://gerrit.wikimedia.org/r/825328

Change 825330 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Switchover es5 master

https://gerrit.wikimedia.org/r/825330

Change 825331 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Update es5-master

https://gerrit.wikimedia.org/r/825331

Mentioned in SAL (#wikimedia-operations) [2022-08-22T11:47:23Z] <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es5 T315542

Mentioned in SAL (#wikimedia-operations) [2022-08-22T11:47:39Z] <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es5 T315542

Change 825328 merged by jenkins-bot:

[operations/mediawiki-config@master] db-production.php: Disable writes on es5

https://gerrit.wikimedia.org/r/825328

Mentioned in SAL (#wikimedia-operations) [2022-08-22T11:51:13Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Disable writes on es5 T315542 (duration: 03m 08s)

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:01:41Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set es1024 with weight 10 T315542', diff saved to https://phabricator.wikimedia.org/P32726 and previous config saved to /var/cache/conftool/dbconfig/20220822-120141-root.json

Change 825330 merged by Marostegui:

[operations/puppet@production] mariadb: Switchover es5 master

https://gerrit.wikimedia.org/r/825330

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:05:33Z] <marostegui> Starting es5 eqiad failover from es1023 to es1024 - T315542

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:06:11Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote es1024 to es5 primary T315542', diff saved to https://phabricator.wikimedia.org/P32727 and previous config saved to /var/cache/conftool/dbconfig/20220822-120611-root.json

Change 825331 merged by Marostegui:

[operations/dns@master] wmnet: Update es5-master

https://gerrit.wikimedia.org/r/825331

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:13:23Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Enable writes on es5 T315542 (duration: 03m 18s)

Mentioned in SAL (#wikimedia-operations) [2022-08-22T12:14:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1023 for reboot T315542', diff saved to https://phabricator.wikimedia.org/P32728 and previous config saved to /var/cache/conftool/dbconfig/20220822-121401-root.json

Marostegui updated the task description. (Show Details)

All done