[Recurring task] CirrusSearch: what is updated during re-indexing
Open, NormalPublic

Description

Since we reindex on a somewhat infrequent basis, we should have a ticket that collects the updates that will happen during the next reindex.

This will be a recurring ticket, meaning that each time we do a reindex, we'll note it here and then use it again to list out new updates that will take effect the next time a reindex is done.

We should make or consider making announcements to the various communities that will be affected by the re-indexing, especially when there is a big gap between the discussion of the upcoming re-index and the actual re-index.

Items not yet done:

Items done:

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
debt triaged this task as Normal priority.Oct 5 2016, 7:57 PM
TJones updated the task description. (Show Details)Oct 6 2016, 6:07 PM

Done:

  • T145023: Searching for insource:tag finds <tag> but not {{#tag:tag}}

In progress (review)

  • T137830: Use the icu_folding filter if available instead of asciifolding
  • T146402: Add ICU_folding filter for EN, FR and EL wiki projects
  • NO_TASK: Add support for ICU tokenization : https://gerrit.wikimedia.org/r/#/c/313577/, relates to issues identified with Scriptio continua languages.

Will need a reindex for NS_FILE:

  • T145561 Reindex all image files to include metadata index fields

Adding a previously written ticket for communicating the for adding the ability to search by metadata after the re-index: T146907

debt claimed this task.Oct 25 2016, 5:27 PM
debt moved this task from Backlog to In progress on the Discovery-Search (Current work) board.

Need to finalize email

Email sent to wikitech-l, wikitech ambassadors, and discovery email lists:
https://lists.wikimedia.org/pipermail/discovery/2016-October/001324.html
https://lists.wikimedia.org/pipermail/wikitech-l/2016-October/086867.html
https://lists.wikimedia.org/pipermail/wikitech-ambassadors/2016-October/001493.html

Latest search updates
After extensive testing over the last several months using a new search query scoring method called BM25 (Best Matching) [1], we recently completed a limited ​production ​release to the following top languages: English, German, Spanish, Russian, Portuguese, French, Italian, Polish, Dutch and Arabic. This new release is replacing the older search method called tf-idf (term frequency-inverse document frequency) [2].

We have ​additional testing to do [3,4] to figure out if BM25 will work in languages that don’t use spaces in-between their words​,​ i.e.: Japanese, Chinese, etc.

The Discovery team announces much of ​our​ completed work in weekly status updates [5​, 6​], but some of the work isn’t actually obvious to anyone who uses our search engine​ - t​hat is because it isn’t actually ‘live’ until a complete re-index of the servers occur. We’ve created a recurring ticket in Phabricator [​7​] to keep track of the work that goes live​ in production​ after a re-index, such as the one we’ve also just completed. A few​ highlights​ of the ​recent ​​re-index are implementing ascii-folding for the French language and ​fixing several​ bugs​ for French ÿ, and Russian ’Е’ and 'Ё' when ​those characters are ​entered in a search query.

Cheers from the Discovery Search Team!

[1] https://en.wikipedia.org/wiki/Okapi_BM25
[2] https://en.wikipedia.org/wiki/Tf%E2%80%93idf
[3] https://phabricator.wikimedia.org/T147495
[4] https://phabricator.wikimedia.org/T147501
​[5] https://www.mediawiki.org/wiki/Wikimedia_Discovery#Updates​
[​6​] https://www.mediawiki.org/wiki/Discovery/Status_updates
[​7​] https://phabricator.wikimedia.org/T147505

Moving to backlog for further additions.

debt added a comment.Jan 24 2017, 6:34 PM

We'll probably need to add to this list some more - please comment when you can, @dcausse :)

Added content_model field: T147505
So mapping probably needs to be adjusted on next reindex.

@EBernhardson, thanks for adding Polish and Swedish!

debt updated the task description. (Show Details)Jul 11 2017, 5:47 PM
debt updated the task description. (Show Details)Jul 11 2017, 5:49 PM

Updated ticket for keeping items in the description, rather than having to scroll down through the ticket to see what needs to be done and what is already done. :)

I think I reindexed hewiki (in-place) on the course of reindexing the wikis for archive. Is that enough or more thorough reindex is needed?

I think I reindexed hewiki (in-place) on the course of reindexing the wikis for archive. Is that enough or more thorough reindex is needed?

It needs to be re-indexed after the new language analyzer is deployed, which hasn't happened yet, so it still needs to be listed here.

Smalyshev updated the task description. (Show Details)Jul 27 2017, 12:53 AM
TJones updated the task description. (Show Details)Aug 16 2017, 10:19 PM
EBernhardson updated the task description. (Show Details)Sep 20 2017, 2:58 PM
dcausse updated the task description. (Show Details)Sep 21 2017, 3:52 PM
EBernhardson updated the task description. (Show Details)Sep 21 2017, 3:55 PM
TJones updated the task description. (Show Details)Sep 25 2017, 5:47 PM

I added a note to consider who should be notified of upcoming re-indexes. Hebrew was delayed for months, so no one was expecting the change when it happened.

A list of places to make any future announcement would be useful, but I'm not sure what all to include.

T147959 should probably be added here?

TJones updated the task description. (Show Details)Oct 10 2017, 7:07 PM

T147959 should probably be added here?

I created a new sub-task ( T177871) which has been added to the task description.

Smalyshev updated the task description. (Show Details)Oct 17 2017, 5:29 PM
TJones updated the task description. (Show Details)Oct 24 2017, 4:10 PM
Smalyshev updated the task description. (Show Details)Oct 24 2017, 5:27 PM
Smalyshev updated the task description. (Show Details)Nov 27 2017, 6:40 PM
debt updated the task description. (Show Details)Dec 19 2017, 6:21 PM
Smalyshev updated the task description. (Show Details)Jan 3 2018, 9:16 PM
Smalyshev updated the task description. (Show Details)Mar 1 2018, 1:15 AM
TJones updated the task description. (Show Details)Mar 5 2018, 5:49 PM
TJones updated the task description. (Show Details)Apr 5 2018, 4:41 PM
Smalyshev updated the task description. (Show Details)May 8 2018, 11:22 PM
Smalyshev updated the task description. (Show Details)Fri, May 25, 5:37 AM
Smalyshev updated the task description. (Show Details)
Smalyshev updated the task description. (Show Details)Fri, Jun 1, 5:33 PM
EBernhardson updated the task description. (Show Details)Fri, Jun 1, 7:46 PM
EBernhardson updated the task description. (Show Details)Fri, Jun 1, 7:58 PM
TJones updated the task description. (Show Details)Mon, Jun 4, 8:03 PM
TJones updated the task description. (Show Details)Tue, Jun 5, 6:46 PM
TJones updated the task description. (Show Details)Thu, Jun 7, 5:15 PM
Smalyshev updated the task description. (Show Details)Fri, Jun 8, 11:49 PM
TJones updated the task description. (Show Details)Mon, Jun 18, 2:26 AM
TJones updated the task description. (Show Details)Tue, Jun 19, 6:16 PM
TJones updated the task description. (Show Details)Thu, Jun 21, 5:49 PM