We want to reduce the number of manual steps/commits to execute during the switchover.
Referring to the steps described in https://wikitech.wikimedia.org/wiki/Switch_Datacenter. I listed the individual steps and TODOs in order to get to a good level of automation; in particular in order to be able to switch over without code commits.
A program with more intelligence can be built on top of this list, or even just a bunch of scripts that can be executed in sequence / in parallel.
Most of the steps here depend on T156100 and T156924 to be done. After that is done, quite a few commits will be necessary to ensure every software is properly configured to use the appropriate systems.
Phase 1: preparation
Stop the jobqueues in $dc_from
cumin 'R:Class = Role::Mediawiki::Jobrunner and *$dc_from.wmnet' 'puppet agent --disable "dc_switchover"' 'service jobrunner stop' 'service jobchron stop'
Stop all jobs running on the maintenance host
Depends on: T156100.
- Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
- `cumin --backend direct $(dig +short mediawiki-maintenance.discovery.wmnet) 'puppet agent -av' 'reboot'
- This needs for noc.wikimedia.org to be served from wasat AND terbium. OR we will cause a downtime of noc.wikimedia.org
Phase 2: read-only mode
Set all shards to ro in $dc_from in mediawiki-config
Depends on: T156924
- Can be made so that if a variable on etcd (e.g. wmfconfig/dbReadOnly[$db_from]) is set to true, all shards become read-only in mediawiki-config
Phase 3: lock down database masters, cache wipes
Set active site's databases (masters) in read-only mode except parsercache ones (which are dual masters) standalone es1 servers (which are always read only) and misc/labs servers (for now, as they are independent from mediawiki and do not have yet clients on codfw).
- sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.eqiad.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'
- sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.eqiad.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=1"'
Wipe new site's memcached to prevent stale values — only once the $dc_to read-only master/slaves have caught up
- TODO: check lag on all masters in $dc_to from the script
- cumin 'R:Class = Role::Memcached and *$dc_to.wmnet' 'systemctl restart memcached.service'
- cumin -b 30 -s 5 'R:Class = Role::Mediawiki::Webserver and *.$dc_to.wmnet' 'service hhvm restart'
Warm up memcached and APC
(on wasat) launch the warmup scripts:
- nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet
- nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appservers.svc.codfw.wmnet
- nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api.svc.codfw.wmnet
Phase 4: switch active datacenter configuration
Depends on: T156924, T156100
Switch the datacenter in puppet
- confctl --object-type discovery select 'dnsdisc=(appservers|imagescaler|api)-rw,name=$dc_to' set/pooled=true
- confctl --object-type discovery select 'dnsdisc=(appservers|imagescaler|api)-rw,name=$dc_from' set/pooled=false
Switch the datacenter in mediawiki-config
- confctl --object-type wmfconfig select 'name=wmfMasterDatacenter' set/value=$dc_to
Phase 5: apply configuration
Depends on: T156100
Redis replication
- TODO: write a program (using cumin as a library, maybe) to check and reverse the redis replica.
Restbase
- Automagically changed when we did switch puppet
Services
- Automagically changed when we switch puppet
Parsoid
- Automagically changed when we switch puppet
- TODO: prepare commit
Switch Varnish backend to appserver.svc.$dc_to.wmnet/api.svc.$dc_to.wmnet
- I think switching the route to 'local' and using appservers-rw.discovery.wmnet should work. TODO: check with traffic
Point Swift imagescalers to the active MediaWiki
- If this uses imagescaler-rw.discovery.wmnet it should be automagic; TODO: check if dns gets cached
Phase 6: database master swap
Database master swap for every core (s1-7), External Storage (es2-3, not es1) and extra (x1) database
- sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.codfw.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SELECT @@global.read_only"'
- sudo cumin 'R:Class = role::mariadb::groups and R:Class%mysql_group = core and R:Class%mysql_role = master and *.codfw.wmnet' 'mysql --skip-ssl --batch --skip-column-names -e "SET GLOBAL read_only=0"'
Phase 7: Undo read-only
Depends on: T156924
- Can be made so that if a variable on etcd (e.g. wmfconfig/dbReadOnly[$db_from]) is set to false, all shards become read-write in mediawiki-config
Phase 8: post-read-only
Depends on: T156100
Start the jobqueues in the new site
- Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
- cumin 'R:Class = Role::MediaWiki::Jobrunner and *$dc_to.wmnet' 'puppet agent -ov --no-deamonize --no-splay'
Start the cron jobs on the maintenance host in codfw
- Make puppet decide status of stopped/started for jobs depend on a value in etcd (discovery/mediawiki-maintenance ?).
- cumin --backend direct $(dig +short mediawiki-maintenance.discovery.wmnet) 'puppet agent -ov --no-daemonize --no-splay'
Re-enable puppet on all eqiad and codfw databases masters
- cumin '(R:Salt::Grain = mysql_role and R:Salt::Grain%value = master) or pc[1-2]*' 'puppet agent --enable'
Run the script to fix broken wikidata entities on the maintenance host of the active datacenter
- sudo -u www-data mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildEntityPerPage.php --wiki=wikidatawiki --force