Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc.
25-27 May, after wmf.3 branch cut:
# Update warmers on current 1.7 cluster
# Record the output of `curl search.svc.codfw.wmnet/_cluster/settings` to be re-applied
# ~~Update plugins repository. Do not sync out to prod~~
# ~~Pull new plugins to beta cluster~~
# ~~Install elasticsearch 2.3.3 deb to beta cluster~~
# ~~Bring down beta cluster elasticsearch servers~~
# ~~Bring full cluster back up. All servers are master capable.~~
# Re-apply transient cluster settings recorded above
# ~~At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.~~
# ~~Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge~~
# Manual testing. Ensure daily browser tests still pass.
# TODO: Is it possible to do more testing with beta cluster this week?
Monday, May 30:
# ~~Prepare and merge patch disabling completion suggester index rebuilds~~ (not needed elastic version check will prevent the script from running)
# ~~Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster. ~~ https://gerrit.wikimedia.org/r/291257
# Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
# Pull plugins repository on terbium and sync out to all elasticsearch servers
## eqiad systems *must*not* be restarted after this has been done
# Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium)
# Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfw
# Bring full codfw cluster down
# Start all codfw master nodes
## Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
# After a master has been decided bring up all codfw data nodes
# Ensure codfw cluster is green and that writes are draining from the job queue into the cluster
Tues:
* Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane
Wed:
* Be available during train rollout
* Like tues, but non-wikipedia sites
Thurs:
# Be available during train deploy for any potential issues.
# Verify via grafana that all search traffic is on codfw.
# Prepare and merge patch enabling completion suggester index rebuilds
Fri/Mon:
# Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
# Follow same process to upgrade eqiad cluster as was used for codfw
# Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster
Alternative ideas rejected:
* Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.
Expected errors:
* If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an `Array to string conversion` notice. These are acceptable.