Page MenuHomePhabricator

Reindex Commons and Wikidata on eqiad and cloudelastic
Closed, ResolvedPublic3 Estimated Story Points

Description

All of the Italian-language wikis and most of the numerous English-language wikis from the parent task (T274200) have finished reindexing.

The following still need to be reindexed:

  • file index for commons on eqiad
  • file index for commons on cloudelastic
  • all indexes for wikidata on eqiad
  • all indexes for wikidata on cloudelastic

It may (or may not) make sense to wait until some of the other recent reindexing-related tasks are complete before working on this task.

Event Timeline

https://phabricator.wikimedia.org/T280184 will likely block the reindex efforts until we fix it (at least for cloudelastic)

It looks like previous reindexing attempts caused T279636, by proxy of creating excessive load in cloudelastic and stressing out the garbage collector. Two things we can do to reduce the load of reindexing:

  • By default cirrussearch asks elasticsearch to reindex using a task per-shard. That would be 32 parallel reindexing tasks running on the 6 node cloudelastic (vs 35 node prod clusters). We can set this by passing --reindexSlices to the UpdateSearchIndexConfig.php script. If memory serves the number of shards needs to be divisible by the chosen number of slices for efficiency reasons. While this can be done from the command line, i suppose cirrus could be updated to check the number of nodes in the cluster and make a better decision when auto detecting number of slices.
  • While I didn't look too deeply, and we've already deleted the previous failures so can't check, it looks like we set the replica count when first creating the index. A more efficient method we use for completion suggester is to set the index to 0 replicas, push all the data in, and then set the replicas to it's final value. This allows the system to copy the final state between instances instead of repeating the indexing work for each replica. This will require some light verification, and then code changes if it's not already doing this in some indirect form.

Change 682768 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Limit load generated by Reindexer auto-slicing

https://gerrit.wikimedia.org/r/682768

Change 683117 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Disable replicas while reindexing

https://gerrit.wikimedia.org/r/683117

Change 682768 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Limit load generated by Reindexer auto-slicing

https://gerrit.wikimedia.org/r/682768

This will be ready for us to try reindexing again after this weeks train.

Change 683117 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Disable replicas while reindexing

https://gerrit.wikimedia.org/r/683117

Removing Erik as the asignee because he worked on the code to improve reindexing (thanks!) but we still need to do the reindexing for these specific wikis, and anyone can pick up the task.

MPhamWMF set the point value for this task to 3.Jun 14 2021, 3:18 PM

The Cloudelastic reindex of commonswiki finished without an explicit error, but died when it tried to create an archive index—so it didn't get to the file index.

hmm, that is "correct" operation with respect to the archive index. The archive index has private data that can't be exposed to cloud. We should make it fail in a more behaved manner. To do just the file index we can change UpdateSearchIndexConfig.php to UpdateOneSearchIndexConfig.php and add --indexType file to the args.

Yeah... I just wasn't thinking about it. I have a tiny patch for T280184 that turns that fatal error into an output message, so it can continue on to the File index under normal operation.