Page MenuHomePhabricator

Switchover es4 master from es1020 to es1021
Closed, ResolvedPublic

Description

This will allow reimaging of es1020 to buster.

Following documentation from MariaDB#External_store_section_failover_checklist

We'll move writes to es5 while we do the switchover.

Steps:

  • Check out wmfmariadbpy on cumin1001: cd ~; git clone https://gerrit.wikimedia.org/r/operations/software/wmfmariadbpy
  • Check out operations/software on cumin1001: cd ~; git clone https://gerrit.wikimedia.org/r/operations/software.git
  • Check current topology: sudo PYTHONPATH=~/wmfmariadbpy ~/wmfmariadbpy/wmfmariadbpy/replication_tree.py es1020
  • Compare old and new master: pt-config-diff h=es1020.eqiad.wmnet,F=/root/.my.cnf h=es1021.eqiad.wmnet,F=/root/.my.cnf
  • Downtime alerts for all es4 hosts
  • Set es1021 (new master) to weight 50:
dbctl instance es1021 set-weight 50
dbctl config commit -m "Set es1021 to weight 50 T257847"
  • Move all slaves below es1021: sudo PYTHONPATH=~/wmfmariadbpy ~/wmfmariadbpy/wmfmariadbpy/switchover.py --timeout=15 --only-slave-move es1020.eqiad.wmnet es1021.eqiad.wmnet
  • Confirm the topology change: sudo PYTHONPATH=~/wmfmariadbpy ~/wmfmariadbpy/wmfmariadbpy/replication_tree.py es1020
  • Disable puppet on es1020 and es1021: cumin 'es102[0-1].eqiad.wmnet' "puppet agent --disable 'switchover to es1021'"
  • Merge puppet CR to change es4 master: https://gerrit.wikimedia.org/r/c/operations/puppet/+/612551
  • Start the failover: !log Starting es4 failover from es1020 to es1021 T257847
  • Merge mediawiki-config CR to disable es4 writes: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/612559
  • Deploy above MW change from deploy1001: cd /srv/mediawiki-staging/; git status; git fetch; git rebase; scap sync-file wmf-config/db-eqiad.php "Disable writes to es4 T257847"
  • Check that es4 is indeed read-only (only heartbeat update statements in mysqlbinlog)
  • Do the switchover:
sudo PYTHONPATH=~/wmfmariadbpy ~/wmfmariadbpy/wmfmariadbpy/switchover.py --skip-slave-move es1020.eqiad.wmnet es1021.eqiad.wmnet
echo "=====> es1020"
sudo -i mysql.py -h es1020 -e "show slave status\G"
echo "=====> es1021"
sudo -i mysql.py -h es1021 -e "show slave status\G"
  • Confirm the topology change: sudo PYTHONPATH=~/wmfmariadbpy ~/wmfmariadbpy/wmfmariadbpy/replication_tree.py es1021
  • Promote es1021 to master in etcd, leave es1020 (old master) with weight 0:
dbctl --scope eqiad section es4 set-master es1021
dbctl config commit -m "Promote es1021 to es4 master T257847"
  • Re-start puppet on both nodes: cumin 'es102[0-1].eqiad.wmnet' "run-puppet-agent -e 'switchover to es1021'"
  • Re-enable es4 on MW: REVERT https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/612559
  • Deploy above MW change from deploy1001: cd /srv/mediawiki-staging/; git status; git fetch; git rebase; scap sync-file wmf-config/db-eqiad.php "Re-enable writes to es4 T257847"
  • Change events for query killer:
sudo -i mysql.py -h es1020 < ~/software/dbtools/events_coredb_slave.sql
sudo -i mysql.py -h es1021 < ~/software/dbtools/events_coredb_master.sql

Date & time: 2020-07-21 (Tuesday) at 07:00 AM UTC

Event Timeline

Kormat updated the task description. (Show Details)

Check that es4 is indeed read-only (How?)

Once you've deployed the RO patch, you can inspect the master current binlog (show master status;) using mysqlbinlog and check that only activity on the heartbeat table is happening.
That should also match a decrease on the writes/activity graph for the given current master.

Marostegui triaged this task as Medium priority.Jul 14 2020, 4:42 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Change 612551 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Promote es1021 to es4 master.

https://gerrit.wikimedia.org/r/612551

Change 612559 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/mediawiki-config@master] db-eqiad.php: Depool cluster26 (es4) from writes.

https://gerrit.wikimedia.org/r/612559

Change 612560 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/dns@master] wmnet: Update es4-master alias

https://gerrit.wikimedia.org/r/612560

Kormat updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2020-07-21T06:54:57Z] <kormat@cumin1001> dbctl commit (dc=all): 'Set es1021 to weight 50 T257847', diff saved to https://phabricator.wikimedia.org/P11974 and previous config saved to /var/cache/conftool/dbconfig/20200721-065457-kormat.json

Change 612551 merged by Kormat:
[operations/puppet@production] mariadb: Promote es1021 to es4 master.

https://gerrit.wikimedia.org/r/612551

Mentioned in SAL (#wikimedia-operations) [2020-07-21T06:59:51Z] <kormat> Starting es4 failover from es1020 to es1021 T257847

Change 612559 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool cluster26 (es4) from writes.

https://gerrit.wikimedia.org/r/612559

Mentioned in SAL (#wikimedia-operations) [2020-07-21T07:03:07Z] <kormat@deploy1001> Synchronized wmf-config/db-eqiad.php: Disable writes to es4 T257847 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2020-07-21T07:21:27Z] <kormat@cumin1001> dbctl commit (dc=all): 'Promote es1021 to es4 master T257847', diff saved to https://phabricator.wikimedia.org/P11975 and previous config saved to /var/cache/conftool/dbconfig/20200721-072127-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-07-21T07:22:52Z] <kormat@cumin1001> dbctl commit (dc=all): 'Depool es1020 from es4 T257847', diff saved to https://phabricator.wikimedia.org/P11976 and previous config saved to /var/cache/conftool/dbconfig/20200721-072251-kormat.json

Mentioned in SAL (#wikimedia-operations) [2020-07-21T07:29:01Z] <kormat@deploy1001> Synchronized wmf-config/db-eqiad.php: Re-enable writes to es4 T257847 (duration: 00m 57s)

Change 612560 merged by Kormat:
[operations/dns@master] wmnet: Update es4-master alias

https://gerrit.wikimedia.org/r/612560

Kormat updated the task description. (Show Details)

All done.

Congratulations on handling your first switchover!

I will keep using es1022 for backups unless you tell me not to.