Page MenuHomePhabricator

Add DEFAULTSORT keys to wiki search autocomplete
Closed, ResolvedPublic

Description

Hello. I'd like to suggest a new feature for wiki search mechanism. I know it is very hard to implement, and it can take years, but it will be very useful and very powerful.
When we search some person, we should know his exactly first name, and can't use the last name only. Sometimes we know just a last name, and autocomplete will not work. In ruwiki, for example, it was resolved by creation a redirection with DEFAULTSORT name for each person, but it's not so cheap. I suggest to add the DEFAULTSORT keys for articles to the autocomplete index, even before the guesses, so searching "Bush" will suggest "George Bush" (using "Bush, George") as a second result, even before "Bushery". Thank you.

Event Timeline

I think it's a very good idea and it should be relatively easy to implement in the completion suggester (slightly more difficult in the old prefixsearch but far from impossible).

This will resolve the lack of redirects on some entities but also ranking issues we see on some entities where the page does not start with a word representative of the entity, e.g. Republic of Ireland has a defaultsort Ireland, Republic of.
Using defaultsort as a new blind input text will allow to use the score of the page instead of the score of the redirect.
It will lead to the following behavior :
Searching for Ireland will probably suggest Republic of Ireland in the top 3, if we are ok to suggest strings that can be completely different from the search query then I think this solution should be implemented.

Benefits are:

  • Suggest more entities that have a defaultsort and no redirects
  • Fix weird ranking issue where pages start with non representative words

Drawbacks:

  • if defaultsort is used with unrelated/non-obvious text it could be very confusing
debt triaged this task as Medium priority.Jul 18 2016, 10:00 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added subscribers: TJones, debt.

@TJones did a bit of research and found:

I think we prefer prefix matches over others. At least in the US, we tend to prefer "Firstname Lastname" when referring to people. So for a query that starts with "Ferre", "Will Ferrell" isn't an awesome match.

Russian Wikipedia, however, prefers to list people as "Lastname, Firstname", so searching for the equivalent there, "Феррел", gets "Феррелл, Уилл" ("Ferrell, Will") as the second match, behind an exact match for a town called "Феррел" / "Ferrel".

Google has similar behavior with their auto-complete in English (see attached screenshot).

OTOH, Google.ru completes "Феррел" with "Ferrell Will filmography" (in Russian, of course) as the second suggestion.

ferrel-search.png (155×219 px, 7 KB)

ferrel-search-ru.png (136×276 px, 7 KB)

It could be helpful for English user two - when he does not remember / know the first name.

I have some additional observations.

  • "Bush, George W." is already indexed (as a redirect) and it doesn't come up in the auto-complete until you get to "Bush, G"; so that's a ranking/sorting issue, which also can be very tough to manipulate.
  • It's not clear that everything has to be fixed in the auto-complete. Single-word searches are often ambiguous, and it's not possible for all reasonable interpretations to be at the top of every list—there just isn't room.
    • Searching for "Bush" or "Ferrell" in the upper right search box both bring up disambiguation pages with both George Bushes and Will Ferrell, respectively.
    • Searching for "Bush" on the Special:Search page gives the disambiguation page and the two George Bushes in the top three results. Searching for "Ferrell" gives Will Ferrell as the fourth result.

Change 307295 had a related patch set uploaded (by DCausse):
Add DEFAULTSORT to search index field data

https://gerrit.wikimedia.org/r/307295

Change 307297 had a related patch set uploaded (by DCausse):
Add support for FLAG_SOURCE_DATA and defaultsort in completion suggester

https://gerrit.wikimedia.org/r/307297

I've added all the necessary code to experiment with defaultsort data.
I'd suggest to build some experimental indices before activating it on production wikis, possible drawbacks could be:

  • bad suggestions when defaultsort is set to a non representative text
  • hides too many suggestions

A test index should help to discover obvious problems, but replaying queries from search logs to detect if the result chosen is now hidden should really help to see any negative impact.

Change 307295 merged by jenkins-bot:
Add DEFAULTSORT to search index field data

https://gerrit.wikimedia.org/r/307295

Change 307297 merged by jenkins-bot:
Add support for FLAG_SOURCE_DATA and defaultsort in completion suggester

https://gerrit.wikimedia.org/r/307297

debt claimed this task.