Page MenuHomePhabricator

Search for acronyms is not properly handled by the FullTextSimpleMatchQueryBuilder query builder
Closed, ResolvedPublic

Description

Searching for acronyms (N.A.S.A) returns bad results with the new query builder implemented for bm25.

The analysis chain has been changed to include the word_break_helper in https://gerrit.wikimedia.org/r/#/c/164150/
This was done to address problems like T42612 and T64733.
Unfortunately breaking words on dots also breaks acronyms. The way acronyms are handled by the QueryString builder is by working on term positions thanks to the auto_generate_phrase_queries Query string option.
Relying on term positions for acronyms is imo a bad solution:

  • it generates unexpected phrase queries that are impossible to control inside cirrus. For acronyms, these phrases can be with tremendously high frequency words (A.A.A).
  • Scoring for phrase is generally sub-optimal, usually words weighting is based on index time statistics (docFreq), for phrases this value is unknown at rewrite time and will be approximated as a sum of the phrase terms idf.

Imo we should not rely on a system that can run phrase queries unexpectedly and acronyms should be handled at the term level.
Unfortunately working at term level (tokenization) is not easy esp. if we don't want to regress on T42612 and T64733.

Proposed quick fix/hack

I suggest to experiment re-using QueryString in the FullTextSimpleMatchQueryBuilder's "all filter". This will allow to continue to use the auto_generate_phrase_queries feature.
I'm really sad that we need to do this, because QueryString is really something I'd like to get rid of, but I don't see other options that could be implemented in a reasonable amount of time.

Event Timeline

dcausse created this task.Aug 22 2016, 10:11 AM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptAug 22 2016, 10:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt triaged this task as Normal priority.Aug 22 2016, 10:14 PM

Change 306929 had a related patch set uploaded (by DCausse):
Fallback to QueryString if we detect acronyms

https://gerrit.wikimedia.org/r/306929

Using QueryString with the "allfilter" does not seem to be enough.
I decided to fallback to QueryString if we detect acronyms or words explicitly broken by the wordbreaker.

Change 306929 merged by jenkins-bot:
Fallback to QueryString if we detect acronyms

https://gerrit.wikimedia.org/r/306929

Change 307261 had a related patch set uploaded (by DCausse):
Fallback to QueryString if we detect acronyms

https://gerrit.wikimedia.org/r/307261

Change 307261 merged by jenkins-bot:
Fallback to QueryString if we detect acronyms

https://gerrit.wikimedia.org/r/307261

Mentioned in SAL [2016-08-29T13:10:58Z] <hashar@tin> Synchronized php-1.28.0-wmf.16/extensions/CirrusSearch/includes/Query/FullTextSimpleMatchQueryBuilder.php: Fallback to QueryString if we detect acronyms T143541 (duration: 00m 50s)

debt closed this task as Resolved.Sep 1 2016, 8:52 PM