Page MenuHomePhabricator

Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC
Closed, ResolvedPublic

Description

At some point we'd need to failover db1109 (row D) to db1104 (row B).
There are some reason to do so:

  • PDUs on row D will eventually need to be replaced like it happened with row A and B (T226778)
  • We currently have many masters on row D, being db1109 one of them.
wikidatawiki

Date: 2020-09-29 08:00 UTC

Related Objects

Event Timeline

It would be great if the job of rebuildTermItems in mwmaint1002 gets disabled (for example killed) right before the failover. I'm worried the script skip lots of items because of the failover.

It would be great if the job of rebuildTermItems in mwmaint1002 gets disabled (for example killed) right before the failover. I'm worried the script skip lots of items because of the failover.

This one?

# Puppet Name: wikidata-rebuildItemTerms
30 * * * * /usr/bin/timeout 3500s /usr/local/bin/mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki wikidatawiki --batch-size 150 --sleep 2 --from-id $(/bin/sed -n '/Rebuilding Q[[:digit:]]\+ till Q\([[:digit:]]\+\)/ { s//\1/; p; }' /var/log/wikidata/wikidata-rebuildItemTerms.log* | /usr/bin/sort -rn | /usr/bin/head -1) >> /var/log/wikidata/wikidata-rebuildItemTerms.log 2>&1
jcrespo changed the task status from Open to Stalled.Apr 29 2020, 7:30 AM
Marostegui changed the task status from Stalled to Open.Aug 28 2020, 11:37 AM
Kormat renamed this task from Switchover s8 primary database master db1109 -> db1104 - Date TBD to Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC.Sep 24 2020, 2:18 PM
Kormat updated the task description. (Show Details)

Steps and checklist:

Preparation

NEW master: db1104
OLD master: db1109

  • Check current topology: db-replication-tree db1109
  • Check configuration differences between new and old master: pt-config-diff h=db1109.eqiad.wmnet,F=/root/.my.cnf h=db1104.eqiad.wmnet,F=/root/.my.cnf
  • Silence alerts on all hosts: cookbook sre.hosts.downtime --hours 1 --reason "switchover to db1104 T239238" '(A:db-section-s8 and A:eqiad) or A:db-labsdb'
  • Set NEW master with weight 0: dbctl instance db1104 set-weight 0 && dbctl config commit -m "Set db1104 with weight 0 T239238"
  • Topology changes, connect everything to db1104: db-switchover --timeout=15 --replicating-master --read-only-master --only-slave-move db1109.eqiad.wmnet db1104.eqiad.wmnet
  • Disable puppet @db1104 and @db1109: cumin 'db110[4,9].eqiad.wmnet' 'disable-puppet "switchover to db1104 T239238"'
  • Merge gerrit puppet change to promote db1104: https://gerrit.wikimedia.org/r/c/operations/puppet/+/629707

Failover:

  • Start the failover: !log Starting s8 eqiad failover from db1109 to db1104 - T239238
  • Topology changes, move old master beneath new master: db-switchover --replicating-master --read-only-master db1109 db1104
  • Give weight to db1109 (old master): dbctl instance db1109 set-weight 300
  • Promote db1104 as new master: dbctl --scope eqiad section s8 set-master db1104 && dbctl config commit -m "Promote db1104 on s8 eqiad master T239238"
  • Restart puppet on old and new masters (for heartbeat): cumin 'db110[4,9].eqiad.wmnet' 'run-puppet-agent -e "switchover to db1104 T239238"'

Clean up tasks:

  • Set thread_pool_stall_limit on old+new masters:
mysql.py -h db1109 -e "set global thread_pool_stall_limit = 100"
mysql.py -h db1104 -e "set global thread_pool_stall_limit = 10"
  • change events for query killer:
events_coredb_master.sql on the new master db1104
events_coredb_slave.sql on the new slave db1109
dbctl instance db1104 set-candidate-master --section s8 false
dbctl instance db1109 set-candidate-master --section s8 true
# There's nothing to commit
  • Check tendril and zarcillo were updated correctly

Change 629707 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/puppet@production] mariadb: Promote db1104 to s8 master

https://gerrit.wikimedia.org/r/629707

Change 629716 had a related patch set uploaded (by Kormat; owner: Kormat):
[operations/dns@master] wmnet: Update s8-master alias

https://gerrit.wikimedia.org/r/629716

dbctl instance db1109 set-candidate-master false

^ this should be true?

The rest looks good, I would only add:

  • Check tendril and zarcillo were updated correctly

Mentioned in SAL (#wikimedia-operations) [2020-09-29T09:51:36Z] <kormat@cumin1001> dbctl commit (dc=all): 'Set db1104 with weight 0 T239238', diff saved to https://phabricator.wikimedia.org/P12829 and previous config saved to /var/cache/conftool/dbconfig/20200929-095135-kormat.json

Change 629707 merged by Kormat:
[operations/puppet@production] mariadb: Promote db1104 to s8 master

https://gerrit.wikimedia.org/r/629707

Mentioned in SAL (#wikimedia-operations) [2020-09-29T10:05:57Z] <kormat> Starting s8 eqiad failover from db1109 to db1104 - T239238

Mentioned in SAL (#wikimedia-operations) [2020-09-29T10:07:24Z] <kormat@cumin1001> dbctl commit (dc=all): 'Promote db1104 on s8 eqiad master T239238', diff saved to https://phabricator.wikimedia.org/P12830 and previous config saved to /var/cache/conftool/dbconfig/20200929-100723-kormat.json

Change 629716 merged by Kormat:
[operations/dns@master] wmnet: Update s8-master alias

https://gerrit.wikimedia.org/r/629716