Put together a production migration plan for ES 2 -> ES 5
Open, NormalPublic

Description

Similar to the 1.x -> 2.x upgrade, we need to come up with a plan for how we will do the production migration. Initial plan, based partially around the 1.x -> 2.x upgrade is as follows:

  1. Have verified both the 2.x and the 5.x codebases can write to the other version cluster. Writes shouldn't be a problem. (although super_detect_noop does need to be disabled during the transition).
  2. Verify (how?) that all indices have explicit slowlog values set. There is no longer a global slowlog configuration, it is per-index.
  3. Upgrade codfw to 5.x (hopefully one day activity, on a monday. Not sure how to make sure it fits in a day though). Initial testing has been done and it doesn't look like there will be any problems loading our 2.x indices into a 5.x instance.
    1. Ship patch to stop sending writes to codfw. Note the time (SAL log or whatever)
      1. Remove codfw from wgCirrusSearchWriteClusters
      2. Disable mirroring from eqiad to codfw in $wgTranslateClustersAndMirrors
      3. Disable updateSuggesterIndex cron for codfw
      4. unset $wgCirrusSearchWikimediaExtraPlugin['super_detect_noop']
    2. Delete all completion suggester indices in codfw, the old format will never be used and it may speed up recovery?
    3. Shut down all hosts in codfw.
    4. Ship puppet patch to update elasticsearch.yml for 5.x
    5. Ship new plugins to all codfw servers
    6. Enable apt experimental repo
    7. Install new elasticsearch.deb
    8. Bring up a single master capable node. Make sure it loads the indices off disk without errors. Once a node is started with 5.x it cannot be rolled back to 2.x without losing the data.
    9. Bring up the rest of the nodes, wait for green. There will probably be massive logspam as elasticsearch tells us about upgrading 9000 shards to the new version.
    10. Ship a patch to mediawiki-config to re-enable writes to codfw
    11. Reindex the timespan codfw was down. May need some hackishness to let the 2.x maint script talk to 5.x (just the version check) - Writes stopped at 2017/03/13 15:20 UTC - re-enabled at 17:30 UTC same day
      1. QUESTION: Can we rebuild completion suggester indices with the next version (not yet deployed)?
  4. Update mediawiki-config to send all search queries for next train deployment to codfw cluster. See https://gerrit.wikimedia.org/r/291257 for how we did this last time. Don't forget about TTMServer
    1. This patch should probably also disable the completion suggester. Serving autocomplete with titlesuggest indices built under 2.x will return undesirable results. Instead we should disable the completion suggester and let all autocomplete fall back to prefix search.
  5. Merge es5 branch of CirrusSearch and Elastica into master, along with https://gerrit.wikimedia.org/r/338032 for GeoData, and let it ride the train
  6. As the train rolls forward rebuild titlesuggest indices with 5.x. After all indices have been rebuilt re-enable completion search.
  7. As the train rolls forward we should disable updateSuggesterIndex cron for eqiad
  8. After train has rolled forward to all wikis perform the same upgrade actions we did to codfw to eqiad. This will require more coordination to ensure other use cases (api feature logs and phabricator) continue to work.
    1. Phabricator should be able to point to codfw and do a rebuild from a maint script, then point the search at codfw, then do yet another rebuild to cover any updates that occured between the first rebuild and the switch over.
    2. Check with @Anomie about api feature log. Perhaps downtime is acceptable here?
    3. Eqiad stopped receiving writes from March 21 at 13:30 UTC to 17:50 UTC same day
  9. Revert prior patch sending all traffic to codfw, allowing traffic to flow to eqiad again.
  10. Re-enable completion suggester cron by reverting https://gerrit.wikimedia.org/r/#/c/342487

Another option i suppose is to find a way to ride the train with the 2.x -> 5.x update. This would prevent the blip of downtime while deploying the incompatible vendor library and es5 code. We could setup mediawiki-config in such a way that the newly deployed branch has its default search cluster to be pointed as the codfw (5.x) cluster.

I looked back through our history, and it looks like for the 1.x -> 2.x migration we rode the train, I think that will be an appropriate option this time as well. The appropriate patch for mediawiki-config was https://gerrit.wikimedia.org/r/#/c/291257/

EBernhardson edited the task description. (Show Details)Feb 8 2017, 7:58 PM
EBernhardson added a subscriber: Anomie.
EBernhardson edited the task description. (Show Details)Feb 8 2017, 9:26 PM
EBernhardson added subscribers: Gehel, dcausse.

@Gehel @dcausse I think this deployment plan is now complete, although there is certainly opportunity for me to have missed a few things. Please review and advise.

Check with @Anomie about api feature log. Perhaps downtime is acceptable here?

It's not a critical service, although I hope some people at least are using it. Are we talking a day here?

BTW, I note that the raw data is stored on fluorine and in logstash so the missed data could theoretically be loaded in after the downtime.

We should have already worked through most of the kinks doing the codfw upgrade, so eqiad ought to go relatively quickly. As long as we stick to the process of shut the whole cluster down, upgrade, bring it back up, i expect it to be able to accept writes again on the same day for apifeatureusage.

EBernhardson added a comment.EditedFeb 21 2017, 7:20 PM

@mmodal Is the migration idea for phabricator above sensible? We are expecting to have around a day worth of downtime in the eqiad cluster, so phabricator will have to point at codfw during that timeframe.

@dcausse @Gehel ping for the ticket i mentioned in our meeting about the 2->5 migration plan

Gehel added a comment.Feb 23 2017, 2:09 PM

Looks good to me. We might learn a few more things when upgrading relforge and will adapt the plan at this point.

We need to upgrade plugins at the same time as elasticsearch (it might be implied in the plan, bet let's make it explicit).

dcausse edited the task description. (Show Details)Feb 23 2017, 2:36 PM

QUESTION: Can we rebuild completion suggester indices with the next version (not yet deployed)?

We could probably find some way to hack this into place, but generally the prod deploy of mediawiki has no support for running wikis in a branch other than the one currently configured.

dcausse edited the task description. (Show Details)Tue, Mar 7, 6:42 PM

Change 342031 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 1: depool codfw for writes

https://gerrit.wikimedia.org/r/342031

Change 342032 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 2: repool codfw and send wmf16 to codfw

https://gerrit.wikimedia.org/r/342032

Change 342033 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 3: depool eqiad for writes

https://gerrit.wikimedia.org/r/342033

Change 342034 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 4: repool eqiad and restore normal operations

https://gerrit.wikimedia.org/r/342034

Gehel added a comment.Mon, Mar 13, 9:29 AM

Upgrade will be done to elasticsearch 5.1.2 (and not the latest 5.2.x) as we found an issue on 5.2.x (T159891).

Change 342031 merged by jenkins-bot:
[operations/mediawiki-config] [es5 upgrade] step 1: depool codfw for writes

https://gerrit.wikimedia.org/r/342031

dcausse edited the task description. (Show Details)Mon, Mar 13, 3:34 PM
Gehel edited the task description. (Show Details)Mon, Mar 13, 4:10 PM

Change 342032 merged by jenkins-bot:
[operations/mediawiki-config] [es5 upgrade] step 2: repool codfw and send wmf16 to codfw

https://gerrit.wikimedia.org/r/342032

dcausse edited the task description. (Show Details)Mon, Mar 13, 5:41 PM

codfw has transitioned to elasticsearch 5.1.2. As the train rolls forward tomorrow group0 wikis will start using it.

Change 342626 had a related patch set uploaded (by 20after4):
[operations/puppet] Phabricator: add config for elasticsearch 5 in codfw

https://gerrit.wikimedia.org/r/342626

Change 342626 merged by Dzahn:
[operations/puppet] Phabricator: add config for elasticsearch 5 in codfw

https://gerrit.wikimedia.org/r/342626

Change 342033 merged by jenkins-bot:
[operations/mediawiki-config] [es5 upgrade] step 3: depool eqiad for writes

https://gerrit.wikimedia.org/r/342033

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:34:38Z] <dcausse@tin> Synchronized wmf-config/CommonSettings.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (1/3) (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:37:03Z] <dcausse@tin> Synchronized wmf-config/InitialiseSettings.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (2/3) (duration: 00m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:39:04Z] <dcausse@tin> Synchronized wmf-config/CirrusSearch-common.php: T157479 [es5 upgrade] step 3: depool eqiad for writes (3/3) (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:40:47Z] <dcausse@tin> Synchronized wmf-config/CirrusSearch-common.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (3/3) (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:49:35Z] <dcausse@tin> Synchronized wmf-config/InitialiseSettings.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (2/3) (duration: 00m 42s)

Mentioned in SAL (#wikimedia-operations) [2017-03-20T15:50:34Z] <dcausse@tin> Synchronized wmf-config/CommonSettings.php: Revert: T157479 [es5 upgrade] step 3: depool eqiad for writes (1/3) (duration: 00m 42s)

Change 343665 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 3: depool eqiad for writes (take 2)

https://gerrit.wikimedia.org/r/343665

Change 343665 merged by jenkins-bot:
[operations/mediawiki-config] [es5 upgrade] step 3: depool eqiad for writes (take 2)

https://gerrit.wikimedia.org/r/343665

Change 343869 had a related patch set uploaded (by Gehel):
[operations/puppet] elasticsearch - upgrade eqiad to elasticsearch 5

https://gerrit.wikimedia.org/r/343869

Mentioned in SAL (#wikimedia-operations) [2017-03-21T14:07:34Z] <gehel> upgrading elasticsearch eqiad to v5.x - T157479

dcausse edited the task description. (Show Details)Tue, Mar 21, 2:09 PM

Change 343869 merged by Gehel:
[operations/puppet] elasticsearch - upgrade eqiad to elasticsearch 5

https://gerrit.wikimedia.org/r/343869

Mentioned in SAL (#wikimedia-operations) [2017-03-21T14:34:12Z] <gehel> deleting old v2 indices from elastic1030: azbwiki_general_first, vewikimedia_content_1415331110, vewikimedia_general_1415331150 - T157479

Mentioned in SAL (#wikimedia-operations) [2017-03-21T14:39:25Z] <gehel> deleting old v2 indices from each elasticsearch server - T157479

Mentioned in SAL (#wikimedia-operations) [2017-03-21T14:44:03Z] <gehel> elasticsearch eqiad, full cluster restart after cleanup of known old indices - T157479

Mentioned in SAL (#wikimedia-operations) [2017-03-21T14:58:56Z] <gehel> elasticsearch upgrade on eqiad is completed - T157479

Change 343919 had a related patch set uploaded (by DCausse):
[operations/mediawiki-config] [es5 upgrade] step 4: repool eqiad for writes

https://gerrit.wikimedia.org/r/343919

Change 343919 merged by jenkins-bot:
[operations/mediawiki-config] [es5 upgrade] step 4: repool eqiad for writes

https://gerrit.wikimedia.org/r/343919

dcausse edited the task description. (Show Details)Tue, Mar 21, 6:08 PM

Mentioned in SAL (#wikimedia-operations) [2017-03-22T12:29:01Z] <dcausse> cirrus: reindexing lost writes (2017-03-21T13:30:00Z to 2017-03-21T17:50:00Z) during es5 upgrade in elastic@eqiad (T157479)

dcausse added a comment.EditedThu, Mar 23, 9:21 AM

The reindexing for enwiki failed with:

[9d01b7f9070f43413869b587] [no req]   DBConnectionError from line 755 of /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/rdbms/database/Database.php: Cannot access the database: MySQL server has gone away (10.64.48.153)
Backtrace:
#0 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(943): Database->reportConnectionError(string)
#1 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(621): Wikimedia\Rdbms\LoadBalancer->reportConnectionError()
#2 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(693): Wikimedia\Rdbms\LoadBalancer->getConnection(integer, string, string)
#3 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUtils.php(65): Wikimedia\Rdbms\LoadBalancer->getConnectionRef(integer, string, string)
#4 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php(487): CentralAuthUtils::getCentralSlaveDB()
#5 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/objectcache/WANObjectCache.php(889): CentralAuthUser->{closure}(boolean, integer, array, NULL)
#6 [internal function]: WANObjectCache->{closure}(boolean, integer, array, NULL)
#7 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/objectcache/WANObjectCache.php(1009): call_user_func_array(Closure, array)
#8 /srv/mediawiki/php-1.29.0-wmf.16/includes/libs/objectcache/WANObjectCache.php(895): WANObjectCache->doGetWithSetCallback(string, integer, Closure, array, NULL)
#9 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php(502): WANObjectCache->getWithSetCallback(string, integer, Closure, array)
#10 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php(357): CentralAuthUser->loadFromCache()
#11 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthUser.php(532): CentralAuthUser->loadState()
#12 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CentralAuth/includes/CentralAuthIdLookup.php(83): CentralAuthUser->getId()
#13 /srv/mediawiki/php-1.29.0-wmf.16/extensions/GlobalUserPage/GlobalUserPage.body.php(143): CentralAuthIdLookup->isAttached(User)
#14 /srv/mediawiki/php-1.29.0-wmf.16/extensions/GlobalUserPage/GlobalUserPage.hooks.php(149): GlobalUserPage::shouldDisplayGlobalPage(Title)
#15 [internal function]: GlobalUserPageHooks::onWikiPageFactory(Title, NULL)
#16 /srv/mediawiki/php-1.29.0-wmf.16/includes/Hooks.php(186): call_user_func_array(string, array)
#17 /srv/mediawiki/php-1.29.0-wmf.16/includes/page/WikiPage.php(126): Hooks::run(string, array)
#18 /srv/mediawiki/php-1.29.0-wmf.16/includes/page/WikiPage.php(182): WikiPage::factory(Title)
#19 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CirrusSearch/maintenance/forceSearchIndex.php(469): WikiPage::newFromRow(stdClass, integer)
#20 [internal function]: CirrusSearch\ForceSearchIndex->CirrusSearch\{closure}(array)
#21 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CirrusSearch/includes/iterator/CallbackIterator.php(19): call_user_func(Closure, array)
#22 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CirrusSearch/maintenance/forceSearchIndex.php(175): CirrusSearch\Iterator\CallbackIterator->current()
#23 /srv/mediawiki/php-1.29.0-wmf.16/maintenance/doMaintenance.php(111): CirrusSearch\ForceSearchIndex->execute()
#24 /srv/mediawiki/php-1.29.0-wmf.16/extensions/CirrusSearch/maintenance/forceSearchIndex.php(591): require_once(string)
#25 /srv/mediawiki/multiversion/MWScript.php(99): require_once(string)
#26 {main}

I'm afraid that this process is too slow. I'm tempted to let the sanitizer fix the problem itself.
@EBernhardson, @Gehel any objections?
If you're ok it means that we can switch traffic back to eqiad when we want.

What is the user impact of this error? A few pages missing from the index? If so, how many, roughly?

@Deskana this is hard to tell...
Looking at the counts eqiad seems to lack 2k docs but only 60 for the main namespace.
Sadly I cannot track lost updates of existing pages since it'll require to run this slow query.
I think we have two options:

  • conservative: continue to serve traffic from codfw while eqiad is being fixed by the sanitizer (roughly two weeks)
  • switch to eqiad assuming that the discrepancies are not too dramatic and won't cause much frustration.

@EBernhardson may have an idea for a third option?

@Deskana this is hard to tell...
Looking at the counts eqiad seems to lack 2k docs but only 60 for the main namespace.
Sadly I cannot track lost updates of existing pages since it'll require to run this slow query.
I think we have two options:

  • conservative: continue to serve traffic from codfw while eqiad is being fixed by the sanitizer (roughly two weeks)
  • switch to eqiad assuming that the discrepancies are not too dramatic and won't cause much frustration.

From a user perspective, I think it's fine if there are ~60 pages missing, as your average user is incredibly unlikely to be searching for one of the missing pages. I wouldn't block this purely on that, since like you say the sanitiser should take care of it eventually.

That said, if there are outstanding technical concerns (like we think this may be an indication of a larger issue) I'm fine with the conservative approach too.

I think the current solution would be to remove the request for the 'vslow' database when iterating updates, these were necessary due to T147957, which is a relatively expensive query triggered only when iterating updates in a single namespace. We could probably adjust the logic there to only ask for vslow in that particular case rather than always.

At a higher level though, the 2k docs is probably not the end of the world, and we could trigger a manual saneitize run to fix it, or just let the regular process fix it over time.

I think I'll send traffic back to eqiad next monday during EU swat, by the time the sanitizer will have fixed more docs and I believe that the discrepancies won't be noticeable.
Concerning the maint script itself I created T161292 to implement what Erik suggested.