Similar to the 1.x -> 2.x upgrade, we need to come up with a plan for how we will do the production migration. Initial plan, based partially around the 1.x -> 2.x upgrade is as follows:
- Have verified both the 2.x and the 5.x codebases can write to the other version cluster. Writes shouldn't be a problem. (although super_detect_noop does need to be disabled during the transition).
- Verify (how?) that all indices have explicit slowlog values set. There is no longer a global slowlog configuration, it is per-index.
- Upgrade codfw to 5.x (hopefully one day activity, on a monday. Not sure how to make sure it fits in a day though). Initial testing has been done and it doesn't look like there will be any problems loading our 2.x indices into a 5.x instance.
- Ship patch to stop sending writes to codfw. Note the time (SAL log or whatever)
- Remove codfw from wgCirrusSearchWriteClusters
- Disable mirroring from eqiad to codfw in $wgTranslateClustersAndMirrors
- Disable updateSuggesterIndex cron for codfw
- unset $wgCirrusSearchWikimediaExtraPlugin['super_detect_noop']
- Delete all completion suggester indices in codfw, the old format will never be used and it may speed up recovery?
- Shut down all hosts in codfw.
- Ship puppet patch to update elasticsearch.yml for 5.x
- Ship new plugins to all codfw servers
- Enable apt experimental repo
- Install new elasticsearch.deb
- Bring up a single master capable node. Make sure it loads the indices off disk without errors. Once a node is started with 5.x it cannot be rolled back to 2.x without losing the data.
- Bring up the rest of the nodes, wait for green. There will probably be massive logspam as elasticsearch tells us about upgrading 9000 shards to the new version.
- Ship a patch to mediawiki-config to re-enable writes to codfw
- Reindex the timespan codfw was down. May need some hackishness to let the 2.x maint script talk to 5.x (just the version check) - Writes stopped at 2017/03/13 15:20 UTC - re-enabled at 17:30 UTC same day
- QUESTION: Can we rebuild completion suggester indices with the next version (not yet deployed)?
- Ship patch to stop sending writes to codfw. Note the time (SAL log or whatever)
- Update mediawiki-config to send all search queries for next train deployment to codfw cluster. See https://gerrit.wikimedia.org/r/291257 for how we did this last time. Don't forget about TTMServer
- This patch should probably also disable the completion suggester. Serving autocomplete with titlesuggest indices built under 2.x will return undesirable results. Instead we should disable the completion suggester and let all autocomplete fall back to prefix search.
- Merge es5 branch of CirrusSearch and Elastica into master, along with https://gerrit.wikimedia.org/r/338032 for GeoData, and let it ride the train
- As the train rolls forward rebuild titlesuggest indices with 5.x. After all indices have been rebuilt re-enable completion search.
- As the train rolls forward we should disable updateSuggesterIndex cron for eqiad
- After train has rolled forward to all wikis perform the same upgrade actions we did to codfw to eqiad. This will require more coordination to ensure other use cases (api feature logs and phabricator) continue to work.
- Phabricator should be able to point to codfw and do a rebuild from a maint script, then point the search at codfw, then do yet another rebuild to cover any updates that occured between the first rebuild and the switch over.
- Check with @Anomie about api feature log. Perhaps downtime is acceptable here?
- Eqiad stopped receiving writes from March 21 at 13:30 UTC to 17:50 UTC same day
- Revert prior patch sending all traffic to codfw, allowing traffic to flow to eqiad again.
- Re-enable completion suggester cron by reverting https://gerrit.wikimedia.org/r/#/c/342487