Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc.
25-27 May, after wmf.3 branch cut:
Update warmers on current 1.7 cluster Record the output of curl search.svc.codfw.wmnet/_cluster/settings to be re-applied Update plugins repository. Do not sync out to prod Pull new plugins to beta cluster Install elasticsearch 2.3.3 deb to beta cluster Bring down beta cluster elasticsearch servers Bring full cluster back up. All servers are master capable. Re-apply transient cluster settings recorded above At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through. Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge
- Manual testing. Ensure daily browser tests still pass.
- TODO: Is it possible to do more testing with beta cluster this week?
Monday, May 30:
Prepare and merge patch disabling completion suggester index rebuilds(not needed elastic version check will prevent the script from running) Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster.https://gerrit.wikimedia.org/r/291257
- Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
Pull plugins repository on terbium and sync out to all elasticsearch servers
- eqiad systems *must*not* be restarted after this has been done
Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium) Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfw Bring full codfw cluster down Start all codfw master nodes
- Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
After a master has been decided bring up all codfw data nodes
- Ensure codfw cluster is green and that writes are draining from the job queue into the cluster
- Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane
- Be available during train rollout
- Like tues, but non-wikipedia sites
- Be available during train deploy for any potential issues.
- Verify via grafana that all search traffic is on codfw.
- Prepare and merge patch enabling completion suggester index rebuilds
- Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
- Follow same process to upgrade eqiad cluster as was used for codfw
- Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster
Alternative ideas rejected:
- Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.
- If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an Array to string conversion notice. These are acceptable.