Page MenuHomePhabricator

Switchover es4 master es1021 -> es1020
Closed, ResolvedPublic

Description

When: 2nd Feb at 09AM UTC

NEW primary: es1020
OLD primary: es1021

  • Check configuration differences between new and old primary:
sudo pt-config-diff --defaults-file /root/.my.cnf h=es1020.eqiad.wmnet h=es1021.eqiad.wmnet

Failover prep:

sudo cookbook sre.hosts.downtime --hours 2 -r "Switchover es4 T300127" 'A:db-section-es4'
  • Set NEW primary with weight 10
sudo dbctl instance es1020 set-weight 10
sudo dbctl config commit -m "Set es1020 with weight 10 T300127"
  • Topology changes, move all replicas under NEW primary
sudo db-switchover --timeout=15 --only-slave-move es1021 es1020
  • Disable puppet on both nodes
sudo cumin 'es1020* or es1021*' 'disable-puppet "primary switchover T300127"'

Failover:

  • Log the failover:
!log Starting es4 eqiad failover from es1021 to es1020 - T300127
  • Switch primaries:
sudo db-switchover --skip-slave-move es1021 es1020
echo "===== es1021 (OLD)"; sudo db-mysql es1021 -e 'show slave status\G'
echo "===== es1020 (NEW)"; sudo db-mysql es1020 -e 'show slave status\G'
  • Promote NEW primary in dbctl, and remove read-only
sudo dbctl --scope eqiad section es4 set-master es1020
sudo dbctl config commit -m "Promote es1020 to es4 primary T300127"
  • Restart puppet on both hosts:
sudo cumin 'es1020* or es1021*' 'run-puppet-agent -e "primary switchover T300127"'

Clean up tasks:

  • Clean up heartbeat table(s).
  • change events for query killer:
events_coredb_master.sql on the new primary es1020
events_coredb_slave.sql on the new slave es1021
sudo dbctl instance es1021 set-candidate-master --section es4 true
sudo dbctl instance es1020 set-candidate-master --section es4 false
sudo dbctl instance es1021 depool
sudo dbctl config commit -m "Depool es1021 until it's reimaged to buster T300127"

Event Timeline

Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui updated the task description. (Show Details)

Change 758715 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/mediawiki-config@master] db-production.php: Disable writes on es4

https://gerrit.wikimedia.org/r/758715

Change 758716 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote es1020 to es4 master

https://gerrit.wikimedia.org/r/758716

Change 758717 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/dns@master] wmnet: Promote es1020 to es4 master

https://gerrit.wikimedia.org/r/758717

Change 758715 merged by jenkins-bot:

[operations/mediawiki-config@master] db-production.php: Disable writes on es4

https://gerrit.wikimedia.org/r/758715

Mentioned in SAL (#wikimedia-operations) [2022-02-02T07:30:51Z] <marostegui@deploy1002> Synchronized wmf-config/ProductionServices.php: Disable writes on es4 T300127 (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2022-02-02T07:36:12Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Disable writes on es4 T300127 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2022-02-02T07:38:53Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Switchover es4 T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-02T07:38:58Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Switchover es4 T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-02T07:39:18Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Set es1020 with weight 10 T300127', diff saved to https://phabricator.wikimedia.org/P19890 and previous config saved to /var/cache/conftool/dbconfig/20220202-073918-root.json

Change 758716 merged by Marostegui:

[operations/puppet@production] mariadb: Promote es1020 to es4 master

https://gerrit.wikimedia.org/r/758716

All pre-failover steps are done

Mentioned in SAL (#wikimedia-operations) [2022-02-02T08:48:50Z] <root@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Switchover es4 T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-02T08:48:55Z] <root@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Switchover es4 T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-02T09:00:05Z] <marostegui> Starting es4 eqiad failover from es1021 to es1020 - T300127

Mentioned in SAL (#wikimedia-operations) [2022-02-02T09:01:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Promote es1020 to es4 primary and set section read-write T300127', diff saved to https://phabricator.wikimedia.org/P19899 and previous config saved to /var/cache/conftool/dbconfig/20220202-090121-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2022-02-02T09:07:08Z] <marostegui@deploy1002> Synchronized wmf-config/db-production.php: Enable writes on es4 T300127 (duration: 00m 50s)

Change 758717 merged by Marostegui:

[operations/dns@master] wmnet: Promote es1020 to es4 master

https://gerrit.wikimedia.org/r/758717

Mentioned in SAL (#wikimedia-operations) [2022-02-02T09:13:55Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1021 T300127', diff saved to https://phabricator.wikimedia.org/P19901 and previous config saved to /var/cache/conftool/dbconfig/20220202-091355-marostegui.json

Marostegui updated the task description. (Show Details)

All done