Change Details

Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc. 25-27 May, after wmf.3 branch cut: # Update warmers on current 1.7 cluster # Record the output of `curl search.svc.codfw.wmnet/_cluster/settings` to be re-applied # ~~Update plugins repository. Do not sync out to prod~~ # ~~Pull new plugins to beta cluster~~ # ~~Install elasticsearch 2.3.3 deb to beta cluster~~ # ~~Bring down beta cluster elasticsearch servers~~ # ~~Bring full cluster back up. All servers are master capable.~~ # Re-apply transient cluster settings recorded above # ~~At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.~~ # ~~Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge~~ # Manual testing. Ensure daily browser tests still pass. # TODO: Is it possible to do more testing with beta cluster this week? Monday, May 30: # ~~Prepare and merge patch disabling completion suggester index rebuilds~~ (not needed elastic version check will prevent the script from running) # ~~Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster. ~~ https://gerrit.wikimedia.org/r/291257 # Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT # Pull plugins repository on terbium and sync out to all elasticsearch servers ## eqiad systems *must*not* be restarted after this has been done # Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium) # Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfw # Bring full codfw cluster down # Start all codfw master nodes ## Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data. # After a master has been decided bring up all codfw data nodes # Ensure codfw cluster is green and that writes are draining from the job queue into the cluster Tues: * Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane Wed: * Be available during train rollout * Like tues, but non-wikipedia sites Thurs: # Be available during train deploy for any potential issues. # Verify via grafana that all search traffic is on codfw. # Prepare and merge patch enabling completion suggester index rebuilds Fri/Mon: # Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad # Follow same process to upgrade eqiad cluster as was used for codfw # Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster Alternative ideas rejected: * Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary. Expected errors: * If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an `Array to string conversion` notice. These are acceptable.

Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc. 25-27 May, after wmf.3 branch cut: # ~~Update warmers on current 1.7 cluster~~ # Record the output of `curl search.svc.codfw.wmnet/_cluster/settings` to be re-applied # ~~Update plugins repository. Do not sync out to prod~~ # ~~Pull new plugins to beta cluster~~ # ~~Install elasticsearch 2.3.3 deb to beta cluster~~ # ~~Bring down beta cluster elasticsearch servers~~ # ~~Bring full cluster back up. All servers are master capable.~~ # Re-apply transient cluster settings recorded above # ~~At this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.~~ # ~~Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge~~ # Manual testing. Ensure daily browser tests still pass. # TODO: Is it possible to do more testing with beta cluster this week? Monday, May 30: # ~~Prepare and merge patch disabling completion suggester index rebuilds~~ (not needed elastic version check will prevent the script from running) # ~~Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster. ~~ https://gerrit.wikimedia.org/r/291257 # Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT # Pull plugins repository on terbium and sync out to all elasticsearch servers ## eqiad systems *must*not* be restarted after this has been done # Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium) # Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfw # Bring full codfw cluster down # Start all codfw master nodes ## Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data. # After a master has been decided bring up all codfw data nodes # Ensure codfw cluster is green and that writes are draining from the job queue into the cluster Tues: * Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane Wed: * Be available during train rollout * Like tues, but non-wikipedia sites Thurs: # Be available during train deploy for any potential issues. # Verify via grafana that all search traffic is on codfw. # Prepare and merge patch enabling completion suggester index rebuilds Fri/Mon: # Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad # Follow same process to upgrade eqiad cluster as was used for codfw # Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster Alternative ideas rejected: * Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary. Expected errors: * If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an `Array to string conversion` notice. These are acceptable.