Page MenuHomePhabricator

MediaWiki Datacenter Switchover automation
Closed, ResolvedPublic

Description

We want to reduce the number of manual steps/commits to execute during the switchover.

Referring to the steps described in https://wikitech.wikimedia.org/wiki/Switch_Datacenter. I listed the individual steps and TODOs in order to get to a good level of automation; in particular in order to be able to switch over without code commits.

A program with more intelligence can be built on top of this list, or even just a bunch of scripts that can be executed in sequence / in parallel.

Most of the steps here depend on T156100 and T156924 to be done. After that is done, quite a few commits will be necessary to ensure every software is properly configured to use the appropriate systems.

Phase 1: preparation

Stop the jobqueues in $dc_from

cumin 'R:Class = Role::Mediawiki::Jobrunner and *$dc_from.wmnet' 'puppet agent --disable "dc_switchover"' 'service jobrunner stop' 'service jobchron stop'

Stop all jobs running on the maintenance host
Depends on: T156100.

  1. Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
  2. `cumin --backend direct $(dig +short mediawiki-maintenance.discovery.wmnet) 'puppet agent -av' 'reboot'
    1. This needs for noc.wikimedia.org to be served from wasat AND terbium. OR we will cause a downtime of noc.wikimedia.org

Phase 2: read-only mode

Set all shards to ro in $dc_from in mediawiki-config
Depends on: T156924

  1. Can be made so that if a variable on etcd (e.g. wmfconfig/dbReadOnly[$db_from]) is set to true, all shards become read-only in mediawiki-config

Phase 3: lock down database masters, cache wipes

Set active site's databases (masters) in read-only mode except parsercache ones (which are dual masters) standalone es1 servers (which are always read only) and misc/labs servers (for now, as they are independent from mediawiki and do not have yet clients on codfw).

  1. sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.eqiad.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'
  2. sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.eqiad.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=1"'

Wipe new site's memcached to prevent stale values — only once the $dc_to read-only master/slaves have caught up

  1. TODO: check lag on all masters in $dc_to from the script
  2. cumin 'R:Class = Role::Memcached and *$dc_to.wmnet' 'systemctl restart memcached.service'
  3. cumin -b 30 -s 5 'R:Class = Role::Mediawiki::Webserver and *.$dc_to.wmnet' 'service hhvm restart'

Warm up memcached and APC
(on wasat) launch the warmup scripts:

  1. nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet
  2. nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appservers.svc.codfw.wmnet
  3. nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api.svc.codfw.wmnet

Phase 4: switch active datacenter configuration

Depends on: T156924, T156100
Switch the datacenter in puppet

  1. confctl --object-type discovery select 'dnsdisc=(appservers|imagescaler|api)-rw,name=$dc_to' set/pooled=true
  2. confctl --object-type discovery select 'dnsdisc=(appservers|imagescaler|api)-rw,name=$dc_from' set/pooled=false

Switch the datacenter in mediawiki-config

  1. confctl --object-type wmfconfig select 'name=wmfMasterDatacenter' set/value=$dc_to

Phase 5: apply configuration

Depends on: T156100
Redis replication

  • TODO: write a program (using cumin as a library, maybe) to check and reverse the redis replica.

Restbase

  • Automagically changed when we did switch puppet

Services

  • Automagically changed when we switch puppet

Parsoid

  1. Automagically changed when we switch puppet
  2. TODO: prepare commit

Switch Varnish backend to appserver.svc.$dc_to.wmnet/api.svc.$dc_to.wmnet

  • I think switching the route to 'local' and using appservers-rw.discovery.wmnet should work. TODO: check with traffic

Point Swift imagescalers to the active MediaWiki

  • If this uses imagescaler-rw.discovery.wmnet it should be automagic; TODO: check if dns gets cached

Phase 6: database master swap

Database master swap for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database

  1. sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.codfw.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'
  2. sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.codfw.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=0"'

Phase 7: Undo read-only

Depends on: T156924

  1. Can be made so that if a variable on etcd (e.g. wmfconfig/dbReadOnly[$db_from]) is set to false, all shards become read-write in mediawiki-config

Phase 8: post-read-only

Depends on: T156100

Start the jobqueues in the new site

  1. Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
  2. cumin 'R:Class = Role::MediaWiki::Jobrunner and *$dc_to.wmnet' 'puppet agent -ov --no-deamonize --no-splay'

Start the cron jobs on the maintenance host in codfw

  1. Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
  2. cumin --backend direct $(dig +short mediawiki-maintenance.discovery.wmnet) 'puppet agent -ov --no-daemonize --no-splay'

Re-enable puppet on all eqiad and codfw databases masters

  • cumin '(R:Salt::Grain = mysql_role and R:Salt::Grain%value = master) or pc[1-2]*' 'puppet agent --enable'

Run the script to fix broken wikidata entities on the maintenance host of the active datacenter

  • sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+6 -6
operations/puppetproduction+1 -1
operations/switchdcmaster+1 -1
operations/switchdcmaster+1 -1
operations/switchdcmaster+3 -3
operations/mediawiki-configmaster+14 -14
operations/switchdcmaster+6 -5
operations/switchdcmaster+26 -14
operations/switchdcmaster+14 -14
operations/puppetproduction+10 -9
operations/puppetproduction+1 -0
operations/switchdcmaster+6 -3
operations/switchdcmaster+2 -2
operations/switchdcmaster+7 -7
operations/switchdcmaster+3 -2
operations/switchdcmaster+6 -8
operations/switchdcmaster+3 -1
operations/switchdcmaster+9 -5
operations/switchdcmaster+3 -15
operations/switchdcmaster+1 -3
operations/switchdcmaster+13 -2
operations/switchdcmaster+8 -11
operations/switchdcmaster+2 -1
operations/switchdcmaster+4 -2
operations/switchdcmaster+5 -6
operations/switchdcmaster+46 -39
operations/switchdcmaster+74 -16
operations/switchdcmaster+37 -57
operations/switchdcmaster+6 -0
operations/switchdcmaster+116 -44
operations/switchdcmaster+1 -1
operations/switchdcmaster+1 -1
operations/puppetproduction+135 -4
operations/puppetproduction+1 -2
operations/switchdcmaster+1 -1
operations/puppetproduction+1 -0
operations/mediawiki-configmaster+8 -8
operations/switchdcmaster+1 -1
operations/switchdcmaster+8 -6
operations/switchdcmaster+1 -1
operations/switchdcmaster+2 -1
operations/switchdcmaster+4 -4
operations/switchdcmaster+28 -3
operations/switchdcmaster+0 -0
operations/switchdcmaster+17 -15
operations/switchdcmaster+83 -89
operations/switchdcmaster+28 -69
operations/switchdcmaster+9 -16
operations/switchdcmaster+34 -35
operations/switchdcmaster+8 -4
operations/switchdcmaster+18 -16
operations/switchdcmaster+7 -19
operations/switchdcmaster+2 -2
operations/switchdcmaster+15 -16
operations/switchdcmaster+15 -13
operations/switchdcmaster+14 -2
operations/switchdcmaster+3 -4
operations/switchdcmaster+66 -31
operations/switchdcmaster+114 -31
operations/switchdcmaster+26 -14
operations/switchdcmaster+12 -0
operations/switchdcmaster+247 -11
operations/switchdcmaster+117 -0
operations/switchdcmaster+38 -1
operations/switchdcmaster+41 -0
operations/switchdcmaster+1 K -2
operations/switchdcmaster+369 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 345298 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix database selection in tendril task

https://gerrit.wikimedia.org/r/345298

Change 345298 merged by Volans:
[operations/switchdc@master] Fix database selection in tendril task

https://gerrit.wikimedia.org/r/345298

Change 345304 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix database selection for mysql lib

https://gerrit.wikimedia.org/r/345304

Change 345304 merged by Volans:
[operations/switchdc@master] Fix database selection for mysql lib

https://gerrit.wikimedia.org/r/345304

Change 345308 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix query for tendril

https://gerrit.wikimedia.org/r/345308

Change 345308 merged by Volans:
[operations/switchdc@master] Fix query for tendril

https://gerrit.wikimedia.org/r/345308

Change 345349 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Avoid naming clash with conftool

https://gerrit.wikimedia.org/r/345349

Change 345349 merged by Volans:
[operations/switchdc@master] Avoid naming clash with conftool

https://gerrit.wikimedia.org/r/345349

Change 345355 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix import path for mediawiki

https://gerrit.wikimedia.org/r/345355

Change 345355 merged by Volans:
[operations/switchdc@master] Fix import path for mediawiki

https://gerrit.wikimedia.org/r/345355

Change 343859 merged by jenkins-bot:
[operations/mediawiki-config@master] Uniform maintenance message and indentation

https://gerrit.wikimedia.org/r/343859

Change 345656 had a related patch set uploaded (by Volans):
[operations/puppet@production] Swift: use discovery record for the imagescalers

https://gerrit.wikimedia.org/r/345656

Change 345656 merged by Volans:
[operations/puppet@production] Swift: use discovery record for the imagescalers in codfw

https://gerrit.wikimedia.org/r/345656

Mentioned in SAL (#wikimedia-operations) [2017-03-31T14:55:08Z] <volans> deploying the use of discovery URL to swift-proxy hosts in codfw T160178#3136906

Change 345860 had a related patch set uploaded (by Volans):
[operations/puppet@production] Swift-proxy: use discovery everywhere for rewrites

https://gerrit.wikimedia.org/r/345860

Change 345868 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix typo in discovery name

https://gerrit.wikimedia.org/r/345868

Change 345868 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Fix typo in discovery name

https://gerrit.wikimedia.org/r/345868

Change 345860 merged by Volans:
[operations/puppet@production] Swift-proxy: use discovery everywhere for rewrites

https://gerrit.wikimedia.org/r/345860

Change 346279 had a related patch set uploaded (by Volans):
[operations/puppet@production] Switchdc: add profile to install and configure it

https://gerrit.wikimedia.org/r/346279

Change 346279 merged by Volans:
[operations/puppet@production] Switchdc: add profile to install and configure it

https://gerrit.wikimedia.org/r/346279

Change 346332 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix typo for dict access

https://gerrit.wikimedia.org/r/346332

Change 346332 merged by Volans:
[operations/switchdc@master] Fix typo for dict access

https://gerrit.wikimedia.org/r/346332

Change 346549 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Scap: use --force to skip canaries checks

https://gerrit.wikimedia.org/r/346549

Change 346549 merged by Volans:
[operations/switchdc@master] Scap: use --force to skip canaries checks

https://gerrit.wikimedia.org/r/346549

Change 346968 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Add dry-run mode and uses it

https://gerrit.wikimedia.org/r/346968

Change 346968 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Add dry-run mode and uses it

https://gerrit.wikimedia.org/r/346968

Change 346999 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Dry-run: do not notify IRC/SAL

https://gerrit.wikimedia.org/r/346999

Change 346999 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Dry-run: do not notify IRC/SAL

https://gerrit.wikimedia.org/r/346999

Change 347180 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: refactored and standardized

https://gerrit.wikimedia.org/r/347180

Change 347369 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: use the new handler, remove log_dry_run

https://gerrit.wikimedia.org/r/347369

Change 347370 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: uniformed log levels and messages

https://gerrit.wikimedia.org/r/347370

Change 347371 had a related patch set uploaded (by Volans):
[operations/switchdc@master] MySQL: better dry-run handling

https://gerrit.wikimedia.org/r/347371

Change 347372 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Tendril: fix an error in the exception raising

https://gerrit.wikimedia.org/r/347372

Change 347373 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Disable puppet: fix title and docstring

https://gerrit.wikimedia.org/r/347373

Change 347374 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Formatting improvements

https://gerrit.wikimedia.org/r/347374

Change 347375 had a related patch set uploaded (by Volans):
[operations/switchdc@master] MediaWiki: announce explicitly the read-only period

https://gerrit.wikimedia.org/r/347375

Change 347376 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Menu: avoid double failing message

https://gerrit.wikimedia.org/r/347376

Change 347180 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Logging: add multiple handlers to the logger

https://gerrit.wikimedia.org/r/347180

Change 347369 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Logging: use the new handler, remove log_dry_run

https://gerrit.wikimedia.org/r/347369

Change 347370 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Logging: uniformed log levels and messages

https://gerrit.wikimedia.org/r/347370

Change 347371 merged by Giuseppe Lavagetto:
[operations/switchdc@master] MySQL: better dry-run handling

https://gerrit.wikimedia.org/r/347371

Change 347372 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Tendril: fix an error in the exception raising

https://gerrit.wikimedia.org/r/347372

Change 347373 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Disable puppet: fix title and docstring

https://gerrit.wikimedia.org/r/347373

Change 347374 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Formatting improvements

https://gerrit.wikimedia.org/r/347374

Change 347375 merged by Giuseppe Lavagetto:
[operations/switchdc@master] MediaWiki: announce explicitly the read-only period

https://gerrit.wikimedia.org/r/347375

Change 347376 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Menu: avoid double failing message

https://gerrit.wikimedia.org/r/347376

Change 347390 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: fix and simplify the stderr logging

https://gerrit.wikimedia.org/r/347390

Change 347390 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Logging: fix and simplify the stderr logging

https://gerrit.wikimedia.org/r/347390

Change 347393 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Fix logging/dry-run setup orders

https://gerrit.wikimedia.org/r/347393

Change 347393 merged by Volans:
[operations/switchdc@master] Fix logging/dry-run setup orders

https://gerrit.wikimedia.org/r/347393

Change 347399 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: use NodeSet for more compact outputs

https://gerrit.wikimedia.org/r/347399

Change 347399 abandoned by Volans:
Logging: use NodeSet for more compact outputs

Reason:
already included in another CR

https://gerrit.wikimedia.org/r/347399

Change 347511 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Disable puppet: add videoscalers

https://gerrit.wikimedia.org/r/347511

Change 347534 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Logging: filter out all cumin's messages from stderr

https://gerrit.wikimedia.org/r/347534

Change 347534 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Logging: filter out all cumin's messages from stderr

https://gerrit.wikimedia.org/r/347534

Change 347511 abandoned by Volans:
Disable puppet: add videoscalers

Reason:
Already covered by https://gerrit.wikimedia.org/r/#/c/347568/

https://gerrit.wikimedia.org/r/347511

Change 347580 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Mediawiki: return the right value when checking config

https://gerrit.wikimedia.org/r/347580

Change 347588 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Mediawiki: explicitly use UTC for the date to print

https://gerrit.wikimedia.org/r/347588

Change 347580 merged by Volans:
[operations/switchdc@master] Mediawiki: return the right value when checking config

https://gerrit.wikimedia.org/r/347580

Change 347588 merged by Volans:
[operations/switchdc@master] Mediawiki: explicitly use UTC for the date to print

https://gerrit.wikimedia.org/r/347588

Change 347816 had a related patch set uploaded (by Volans):
[operations/puppet@production] Switchdc: rename redis stage from t05 to t06

https://gerrit.wikimedia.org/r/347816

Change 347828 had a related patch set uploaded (by Volans):
[operations/puppet@production] Traffic: format only, noop

https://gerrit.wikimedia.org/r/347828

Change 347844 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Traffic: exclude .wikimedia.org hosts (cp1008)

https://gerrit.wikimedia.org/r/347844

Change 347844 merged by Volans:
[operations/switchdc@master] Traffic: exclude .wikimedia.org hosts (cp1008)

https://gerrit.wikimedia.org/r/347844

Change 347828 merged by Volans:
[operations/puppet@production] cache: noop to test the switchdc procedures

https://gerrit.wikimedia.org/r/347828

Change 347816 merged by Volans:
[operations/puppet@production] Switchdc: rename redis stage from t05 to t06

https://gerrit.wikimedia.org/r/347816

Change 347868 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Change the RO message we match

https://gerrit.wikimedia.org/r/347868

Change 347869 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Move varnish puppet disabling in t00

https://gerrit.wikimedia.org/r/347869

Change 347992 had a related patch set uploaded (by Volans):
[operations/mediawiki-config@master] Use a generic retry for the read only message

https://gerrit.wikimedia.org/r/347992

Change 347868 merged by Volans:
[operations/switchdc@master] Change the RO message we match

https://gerrit.wikimedia.org/r/347868

Change 347869 merged by Volans:
[operations/switchdc@master] Move varnish puppet disabling in t00

https://gerrit.wikimedia.org/r/347869

Change 348055 had a related patch set uploaded (by Volans):
[operations/switchdc@master] Menu: don't allow to quit from submenu

https://gerrit.wikimedia.org/r/348055

Change 348055 merged by Volans:
[operations/switchdc@master] Menu: don't allow to quit from submenu

https://gerrit.wikimedia.org/r/348055

Change 347992 merged by jenkins-bot:
[operations/mediawiki-config@master] Use a generic retry for the read only message

https://gerrit.wikimedia.org/r/347992

Mentioned in SAL (#wikimedia-operations) [2017-04-13T12:22:25Z] <volans@tin> Synchronized wmf-config/db-codfw.php: Use a generic retry for the read only message T160178 (duration: 01m 54s)

Mentioned in SAL (#wikimedia-operations) [2017-04-13T12:34:31Z] <volans@tin> Synchronized wmf-config/db-eqiad.php: Use a generic retry for the read only message T160178 (duration: 00m 44s)

Mentioned in SAL (#wikimedia-operations) [2017-04-18T10:25:26Z] <volans> Final test of switchdc steps in the codfw->eqiad configuration, only idempotent changes, T160178

Change 350818 had a related patch set uploaded (by Volans; owner: Volans):
[operations/switchdc@master] Mediawiki: update role name for maintenance

https://gerrit.wikimedia.org/r/350818

Change 350818 merged by Volans:
[operations/switchdc@master] Mediawiki: update role name for maintenance

https://gerrit.wikimedia.org/r/350818

Change 351313 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] cache::text: switch all mediawiki to eqiad

https://gerrit.wikimedia.org/r/351313

Change 351315 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] discovery::app_routes: switch mediawiki to eqiad

https://gerrit.wikimedia.org/r/351315

Change 351335 had a related patch set uploaded (by Volans; owner: Volans):
[operations/switchdc@master] t04: fix title

https://gerrit.wikimedia.org/r/351335

Change 351335 merged by Volans:
[operations/switchdc@master] t04: fix title

https://gerrit.wikimedia.org/r/351335

Change 351346 had a related patch set uploaded (by Volans; owner: Volans):
[operations/switchdc@master] t05_switch_datacenter: fix typo in DNS checks

https://gerrit.wikimedia.org/r/351346

Change 351346 merged by Volans:
[operations/switchdc@master] t05_switch_datacenter: fix typo in DNS checks

https://gerrit.wikimedia.org/r/351346

Change 351315 merged by Giuseppe Lavagetto:
[operations/puppet@production] discovery::app_routes: switch mediawiki to eqiad

https://gerrit.wikimedia.org/r/351315

Change 351313 merged by Giuseppe Lavagetto:
[operations/puppet@production] cache::text: switch all mediawiki to eqiad

https://gerrit.wikimedia.org/r/351313

Volans claimed this task.

Resolving this after a successful MediaWiki switchover to codfw and switchback to eqiad using the automation software Switchdc (operations-switchdc on gerrit). The tracking task for improvements is T163363.