Page MenuHomePhabricator

Wildcard queries should rank results correctly
Open, LowPublic

Description

When I search for catapult the article with Catapult in the title is the first result. If I search for catapul* this article should be the first result.

This is maybe due to the fact that wildcards/prefix queries do not compute the score for these terms (rewrite method top_terms_boost will just hardcode matched term with score = 1).

Event Timeline

dcausse created this task.Jun 26 2015, 12:54 PM
dcausse raised the priority of this task from to Low.
dcausse updated the task description. (Show Details)
dcausse added a subscriber: dcausse.
Restricted Application added a project: Discovery. · View Herald TranscriptJun 26 2015, 12:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse renamed this task from Using wildcards should rank results correctly to Wildcard queries should rank results correctly.Jun 26 2015, 12:55 PM
dcausse set Security to None.
EBernhardson added a subscriber: EBernhardson.EditedJul 1 2015, 5:11 PM

A few tweaks to the issued queries seem to aggree with the premise of the ticket, removing the rescoring done for incoming links a query for catapul? returns all the same scores:

{
  "title": "Catapult/adsf",
  "score": 0.7649317
}
{
  "title": "Two Words",
  "score": 0.7649317
}
{
  "title": "Links To Catapult",
  "score": 0.7649317
}
{
  "title": "Catapult",
  "score": 0.7649317
}
{
  "title": "Amazing Catapult",
  "score": 0.7649317
}

I'm not sure what a proper solution is though, is it to add scoring to wildcard matches, or some sort of rescoring?

EBernhardson added a comment.EditedJul 1 2015, 5:26 PM

It looks like scoring the wildcard matches can be done by using the scoring_boolean query_string rewrite. ES docs indicate this uses non-trivial CPU time, but not sure how that compares to trying to do this in a rescore. Additionally this would require the php side to properly notice a wildcard query being issued and swapping out the rewrite methods from top_terms_boost_1024.

{
  "title": "Catapult",
  "score": 9.552054
}
{
  "title": "Catapult/adsf",
  "score": 8.014623
}
{
  "title": "Links To Catapult",
  "score": 7.8303537
}
{
  "title": "Amazing Catapult",
  "score": 7.8303537
}
{
  "title": "Two Words",
  "score": 0.85436165
}

The problem with scoring_boolean is that it will throw an exception when too many terms match the wildcard/prefix. A query like a* or the ones listed in T88724 will hit TooManyClauses exception.

This issue is very hard to fix and I'm not sure it's worth the effort. There is so many different use cases hidden behind prefix/wildcard queries...
Some use wildcards because language is very hard to analyze and users have to workaround with wildcards (compound words in german).
Some use wildcards with phrases.
Some use wildcards and expect it'll work like a SQL LIKE with %.

In short MultitermQueries (FuzzyQuery/PrefixQuery/WildcardQuery) are not designed to play well with tf.idf similarity function used by lucene.
You can read https://issues.apache.org/jira/browse/LUCENE-2557 which is related to FuzzyQuery but share some similarities with our problem.
or http://stackoverflow.com/questions/9632602/there-is-a-mismatch-between-the-score-for-a-wildcard-match-and-an-exact-match

What was done in lsearchd is to use a dismax query on top of the matched terms : it will also compute the score for each terms (which is bad for perf if there's many terms) but will keep the best term score as the score of the query. This is useful because a normal boolean query will sum all the term scores so if you have a multitermquery that expand to many words large documents will likely have higher score than smaller ones which is not fair.
Unfortunately there is no such rewrite method (with dixmax). We could write one but it requires upstream changes to elasticsearch and lucene.

on irc @dcausse noticed that https://gerrit.wikimedia.org/r/#/c/202230/ is the commit that broke the unit test. Also on irc dcausse suggested we could try using top_terms_N on wildcard queries that match some pattern (number of characters preceding the wildcard?) but we need a way to measure the effectiveness of this change.

Deskana added a subscriber: Deskana.Jul 2 2015, 4:45 PM

Backlogging this because I don't think it's too important. Scream at me if you disagree and we can talk. :-)