Page MenuHomePhabricator

Poorly tuned rankings
Closed, ResolvedPublic

Description

 I just did a search for "newspapers of the United States" on the English Wikipedia. The article "Newspapers in the United States" exists, and should have been the first or second result, but isn't even on the first page:

https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=newspapers+of+the+United+States&searchToken=7sx379bntb9q6q9qi2h0lriv8

I had to make a redirect so people can find the correct article. It seems the results are poorly ranked, I assume a tuning issue because this is a chronic problem across many search that feel like that should have given much better results. I recently optimized Solr queries for better ranking for an education search engine; I might be of some technical assistance if someone wants to point me in the right direction for experimenting with tweaks. I have a hunch whatever query is being run just needs to put more weight on article titles, and do the equivalent of what Solr calls Phrase Fields.

Event Timeline

Beland created this task.May 4 2016, 6:15 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptMay 4 2016, 6:15 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
dcausse added a subscriber: dcausse.EditedMay 4 2016, 7:40 AM

Thanks for the feedback @Beland!
You're absolutely right, we have serious problems with words in the title. Most of the time a perfect match on the redirects or title is what really help to rank such articles to the top.
What we found so far is that it is mostly due to a performance hack which collapses all the fields into a single one (see T125083).
My favorite query is legend film 2015.
If you want to dig into more details you can extract various debug info from Cirrus:

  • the query : add the extra URI param cirrusDumpQuery, e.g. legend film 2015
  • the lucene explanation: add cirrusDumpResult and cirrusExplain, e.g. legend film 2015
  • disable the allfield performance hack cirrusUseAllFields=no, e.g. legend film 2015

The plan to resolve such issues is today to switch to BM25 instead of the classic lucene TFIDF similarity and probably experiment with shingles on title/redirects. I don't know much about solr phrase fields but we already have a phrase rescore that is supposed to rank higher pages with a phrase match. Unfortunately in this case because one word was missing ("of") the phrase did not match. I hope that shingles (ngram size 3) can help to mitigate this issue.

If you are willing to help I'd suggest you join the the discovery mailing list, I think it's the best place to be informed about the new features & experiments we are working on.

Thanks!

debt triaged this task as Low priority.Jul 20 2016, 4:07 PM
debt moved this task from needs triage to later on... on the Discovery-Search board.
Deskana closed this task as Resolved.Nov 10 2016, 3:31 AM
Deskana claimed this task.
Deskana added a subscriber: Deskana.

Our switch to BM25 fixed this problem! \o/