Page MenuHomePhabricator

CompletionSuggester investigate expanding the list of stopwords
Open, MediumPublic

Description

Having only english stopwords leads to the following situation where a fuzzy match is shown but a better match is available if you ignore some non-english stopwords:

On English Wikipedia, if I type "Ruta Maya" in the search bar, the only suggestion I get is "Rita May (actress)". I'm impressed that it suggests Rita May, but confused that it doesn't suggest "La Ruta Maya" (which does get suggested at Special:Search).

Specific cases can be fixed by adding a redirect but it sounds interesting to investigate in a more generic solution by expanding the stopwords to other languages, esp. on english wikipedias where there are many non-english titles.

Event Timeline

kaldari created this task.Apr 11 2016, 9:07 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptApr 11 2016, 9:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kaldari renamed this task from "Ruta Maya" fails to find "La Ruta Maya" in search suggetions to "Ruta Maya" fails to find "La Ruta Maya" in search suggestions.Apr 11 2016, 9:07 PM

The reason it works on Special:Search is that it isn't really a suggestion, it's a full text search result. The auto complete (search bar) does some handling of stop words, but i think we are only stripping english stop words for the english wikipedia. There is a good change 'La' isn't considered a stop word in that case.

I suppose the open question should be, since enwiki contains plenty of non-english stop words in titles should we find a way to handle those too? @dcausse is that easily possible?

Deskana closed this task as Resolved.Apr 11 2016, 9:16 PM
Deskana claimed this task.
Deskana added a subscriber: Deskana.

This is essentially working as intended. The search bar, even with all of the improvements made by the completion suggester, is still fundamentally a title prefix search. The best thing to do here is create a redirect which will help the prefix search point to the right article, which I have done. The next time the completion suggester index updates, it should pick this up; updates presently take around 24 hours, but it is a Q4 goal for the Search Team to get that back to almost instantaneous.

The best thing to do here is create a redirect which will help the prefix search point to the right article, which I have done. The next time the completion suggester index updates, it should pick this up; updates presently take around 24 hours [...]

The index updated to take the new redirect into account already. :-)

I suppose the open question should be, since enwiki contains plenty of non-english stop words in titles should we find a way to handle those too? @dcausse is that easily possible?

Neat! That sounds good to solve the general problem as well.

kaldari added a comment.EditedMay 6 2017, 12:01 AM

@EBernhardson, @dcausse: Following up on Erik's earlier comment, it looks like there are 30,070 pages on English Wikipedia that start with "La ". Is that enough to count as a significant stopword? What stopwords are we considering currently?

(I also search for "El " which had 16,856 matches.)

@kaldari it's hard to tell without further evaluations.
Few questions come to mind: how should we expand this list of stopwords: based on language (english + french + spanish + ...) or based on some data extracted from the data itself?
What's the impact: sadly this will come at a price, increasing the list of stopwords will obviously increase ambiguities at search time.

Note that here the main problem is not really the work to implement the idea but the evaluation itself.

I'll create a parent task to track this kind of work.

dcausse renamed this task from "Ruta Maya" fails to find "La Ruta Maya" in search suggestions to CompletionSuggester investigate expanding the list of stopwords.May 9 2017, 7:53 AM
dcausse updated the task description. (Show Details)
dcausse reopened this task as Open.May 9 2017, 7:58 AM
dcausse removed Deskana as the assignee of this task.

re-opening as the scope of this task has now changed

dcausse updated the task description. (Show Details)May 9 2017, 8:00 AM

sadly this will come at a price, increasing the list of stopwords will obviously increase ambiguities at search time.

@dcausse: Could you explain this a bit more? For example, how would adding "La" affect searches like:

  • "La La Ashlee Simpson" (currently matches "La La (Ashlee Simpson song)")
  • "La la la"
  • "La La's"
  • "LA Airport"
dcausse added a comment.EditedMay 9 2017, 4:30 PM

@kaldari in your examples searching for:

  • Ashlee Simpson will match La La Ashlee Simpson
  • La la la won't be using the stopword filtered index (very similar to searching for to be or not to be today)
  • La La's very similar as above
  • Airport will match LA Airport

When I say will match it means that it will be an additional match, meaning that the ranking function will have to sort additional titles increasing ambiguities.
The effect is hard to judge in advance as it's possible that new titles (found thanks to this new stopword list) may hide more interesting results we showed before.

This is all about evaluation, we have some tools to evaluate quantitatively the magnitude of the change (based on a random set of queries). If the change is small then it's probably safe to move forward. If the change is high we will have to manually review some samples to see if the effect is positive or negative.
We don't have yet tools to do qualitative evaluations for autocomplete results.

As an example here is what I did for another feature request regarding autocomplete search: https://phabricator.wikimedia.org/T145427#2660006
Here I've run a set of sample queries for comparing existing and new behavior, the effect was huge, I manually reviewed a sample and the results seemed to be mostly positive.
I think we need to do something similar here, decide the set of new stopwords for enwiki and run an evaluation to see how to move forward.

debt triaged this task as Medium priority.May 11 2017, 5:08 PM
debt moved this task from needs triage to later on... on the Discovery-Search board.