Right now we don't have a good way to upgrade to Elasticsearch 2.3 without turning user-facing search features off for a little while. Breaking search for our users is not acceptable. We need a better way! This task is for us to find one, so that we can do the upgrade without breaking search for our users.
Upgrade plan, starting with codfw cluster. Consider this a first draft to be discussed at offsite:
25-27 May, after wmf.3 branch cut:
# Update plugins repository. Do not sync out to prod
# Pull new plugins to beta cluster
# Install elasticsearch 2.3.2 deb to beta cluster
# Bring down beta cluster elasticsearch servers
# Bring full cluster back up. All servers are master capable.
# At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.
# Merge es2.x branch to master (CirrusSearch and Elastica). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge
# Manual testing. Ensure daily browser tests still pass.
# TODO: Is it possible to do more testing with beta cluster this week?
Monday, May 30:
# Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster.
# Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
# Pull plugins repository on terbium and sync out to all elasticsearch servers
## eqiad systems *must*not* be restarted after this has been done
# Install elasticsearch 2.3.2 deb to all elasticsearch servers in codfw
# Bring full codfw cluster down
# Start all codfw master nodes
## Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
# After a master has been decided bring up all codfw data nodes
# Ensure codfw cluster is green and that writes are draining from the job queue into the cluster
Tues:
* Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane
Wed:
* Be available during train rollout
* Like tues, but non-wikipedia sites
Thurs:
# Be available during train deploy for any potential issues.
# Verify via grafana that all search traffic is on codfw.
Fri/Mon:
# Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
# Install elasticsearch 2.3.2 deb to all elasticsearch servers in eqiad
# Bring full eqiad cluster down
# Start all eqiad master nodes
# After a master has been decided bring up all eqiad data nodes
# Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster
Alternative ideas rejected:
* Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.
Expected errors:
* If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an `Array to string conversion` notice. These are acceptable.