Page MenuHomePhabricator

[User] Search results should be better prioritized
Closed, ResolvedPublic

Description

Search results seem to be hit or miss for how they're prioritized. For example, consider the following queries and results in Android and mobile web:

  1. "red": #3 Breast reduction
  2. "dog": #12 Feces
  3. "run": #3 Sex and nudity in videogames

With the exception of #3, I believe these are full text searches. In some cases, such as those above, it seems like we're surfacing very popular article content that has some coincidental but extremely loose association to the query.

Event Timeline

Niedzielski raised the priority of this task from to Needs Triage.
Niedzielski updated the task description. (Show Details)

To this, I'll also add:
When typing a single letter into the search box, the results seem very oddly prioritized. For example, when I search for "C", the first three results are:

  • C
  • Chlaenius
  • Habeas corpus petitions of Guantanamo Bay detainees

Off the top of my head, here are some better results that we can be showing for "C":

  • C (musical note)
  • C (programming language)
  • C (chemical element)
  • c (speed of light)
  • C (vitamin)

Actually, all these things appear in the disambiguation page for C.
So... show that.

I'm not sure how to file this ticket, this isn't an actionable feature request but just a general "its not what i wanted" ticket

@Niedzielski: What is the relation of this task to Wikimedia-Developer-Summit-2016? Is this a talk proposal?

@EBernhardson, @Aklapper: I asked the guys to file new tasks for stuff for the search dev summit topic. I think that's where this is coming from. The question is: could search endpoints automagically merge titles into result output using a set of heuristics that is aware of wiki conventions like disambiguation? Or does tuning of the page ranking algorithm achieve the same sort of "expected" behavior? I think the search phrases to most-clicked links stuff event logging stuff you're looking at can help tune the result sets, so maybe that gets us there...as long as users actually dig deep enough to get at the very thing they were looking at that would be more sensibly ranked higher.

Presently, the apps have a heuristic where they do prefix searching, and if getting insufficient results, fall back to fulltext searching to augment the result set. Presently, when did-you-mean is present, that's shown in the apps, but similarly, would it make sense for an orchestration layer to just fire that did-you-mean as search terms and fold in the result set somehow?

Everything done in cirussearch in the past for adjusting the rankings of things has been by trying to integrate features of the pages, but as you see it has worked incredibly poorly. Google, Bing, DuckDuckGo and whoever else don't have an intrinsic understanding of the web pages they are surfacing results for but return *much* better results.

My intuition is this is because we are overfitting specific individual cases we look at, such as incoming links, redirects, and disambiguation. We have been trying to shift our focus for search relevancy from integrating individual aspects of pages to doing large scale data processing. For example currently the number of incoming links to an article plays a big part in its scoring process, but we know that is a naive way to consider the quality of a page. To that end we are working on integrating proper PageRank (ala larry and sergey). Additionally having matching redirects plays a big part. Almost certainly "Breast reduction" comes up for "red" because there are many many redirects coming into breast reduction and almost all of them have "reduction" as the begining of either the first or second word of the redirect.

In addition to page rank we also are working up plans to integrate page view statistics along with search result click throughts into scoring mechanics.

IMHO this problem is due to the fact that the android app does not distinguish "search as you type" and "search", it's an all in one search.

So if the user searches for red he has no way to tell the app that the query is finished and he does not care about page titles starting with red.
Here red will match the prefix Reduction mammaplasty
If you search for red using special search you'll have only relevant results for red

For dog it will match the redirect Dog feces that will redirect to Feces, on special search feces is not in the first page.

For run it will match the redirect Run and rape video game which points to Sex and nudity in video games, with special search this page is ranked #8

So to sum up, yes scoring (as mentioned by Erik) is not really optimal for prefix search (search as you type).

But I think the most important problem is the fact we mix prefix search (search as you type) and full text search in the same query.

  • full text search makes sense when you want to see the most relevant results according to the words you typed.
  • prefix search ignores completely the lucene scoring functions (because it makes no sense for a prefix, a prefix has always 1 occurrence in the content) and will rely only on incoming links to sort results (this is where we are experimenting with new algorithms).

@Qgil, T113540, "What can the Search API do for you?" is the summit card. This task, T114896, has been setup as "blocking" to relate it to the summit task. Does that make sense?

Totally, we are just trying to keep the Wikimedia-Developer-Summit-2016 project tidy, and some people are trying to squeeze session proposals after the deadline. Sorry for the confusion.

Part of the problem in this case is the fact that redirects are resolved in the app. If you search for "red" using prefixsearch, the third result is [[Reduction mammaplasty]] which is a redirect to [[Breast reduction]], causing the somewhat strange result.

The reason I say in this case above is because there are cases where resolving redirects results in a much better experience, such as searching for "borack obama", it's better to show the user the correct title rather than the typo. Basically, the difference between redirects due to syntactic errors (like spelling errors) vs redirects from alternative names. Based on my personal experience, generally speaking resolving redirects solves far more problems than it causes (like this one).

The issue that @dcausse mentioned is probably more like the root cause, what I'm saying here is a surface problem.

I was testing this out this morning on my iphone and discovered that the search results seem be much better:

"red" doesn't have the breast reduction in the top listing
"dog" has feces now showing up at #13 position on the list
"run" doesn't have the sex and nudity in video games in the list

Is it ok to close this out?

@debt, nice! It looks better for the three examples given in the description but I still think some consideration for @Dbrant's example should be considered before closing this.

Out of curiosity what was the reason to resolve redirects?
Was it to avoid displaying awful redirects with typos?

Deskana triaged this task as Medium priority.

Given that the primary use cases were fixed, I'm calling this one resolved.

Generally, knowing a few queries that produce suboptimal results is interesting, but it's not really easy for us to fix specific examples without putting extremely specific hacks into the code, which will cause more pain than gain in the long run. Whenever you tweak some parameter trying only to make one specific query better, you tend to make a ton of other ones worse. The long and short of it is that there is a limited amount we can do here.