Page MenuHomePhabricator

Do not depend on spaces to activate proximity rescoring (phrase rescore)
Closed, ResolvedPublic


The way we decide to activate the phrase rescore is based on the presence of a space in the query. Obviously it does not work for spaceless languages. As of today we use QueryString with auto_generate_phrase_queries=true and this problem does not really matter yet, main query will run directly a phrase query if the query string is tokenized as multiple ones.
But as soon as we'll drop query string support the phrase rescore is unlikely to be activated on spaeceless languages.
One approach suggested by Erik would be to create a new Query in the extra plugin that could act as a "router" to activate a particular subquery based on the number of tokens in the input query. It would permit to let elastic uses the lucene analysis chain in place to count the number of token without having to implement complex tokenizers in cirrus.

Event Timeline

dcausse created this task.Dec 1 2016, 1:04 PM
Restricted Application added a project: Discovery. · View Herald TranscriptDec 1 2016, 1:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Deskana triaged this task as Normal priority.Dec 8 2016, 11:10 PM
Deskana added a subscriber: Deskana.

Given our generally bad support for spaceless languages, this seems relatively important.

dcausse claimed this task.Dec 22 2016, 4:03 PM

Will start to work on this as it may help Erik with its LTR plugin.

Change 330147 had a related patch set uploaded (by DCausse):
Add token_count_router

Change 330147 merged by jenkins-bot:
Add token_count_router

Should we write the support for this into es5 branch, or close this ticket and create another to adjust our query building to use it?

dcausse changed the task status from Open to Stalled.Feb 1 2017, 2:08 PM

I was planning to wait for es5. Marking as stalled.

Started to add cirrus integration, sadly the version of the plugin is not working yet, moving back to the backlog waiting for the fix to be deployed.

Change 345585 had a related patch set uploaded (by DCausse):
[mediawiki/extensions/CirrusSearch@master] Add support for token_count_router

Change 345585 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add support for token_count_router

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Mass-moving all items tagged for MediaWiki 1.30.0-wmf.3, as that was never released; instead, we're using -wmf.4.

Change 359183 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/mediawiki-config@master] Enable token_count_router for cirrus queries

Change 359183 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable token_count_router for cirrus queries

This is deployed, and as far as i can tell from the explain output, it seems to be working. It's a bit hard because the explain is so verbose, but the score includes a weight(all:"foo bar"~1 in .... which includes a phraseFreq param, which after some testing i'm pretty sure is our phrase query being executed. An english query without any spaces to tokenize on appropriately does not have this score.

Additionally against zhwiki, using the query 意大利軍 which is a string copied from the main page translating roughly (via google translate) to: Italian army and tokenizes into 意大利 - 军 also triggers the phrase matching. Dropping the last character, which then only tokenizes into a single token, does not seem to trigger phrase matching.

debt closed this task as Resolved.Jun 30 2017, 9:24 PM