Searching for acronyms (N.A.S.A) returns bad results with the new query builder implemented for bm25.
The analysis chain has been changed to include the word_break_helper in https://gerrit.wikimedia.org/r/#/c/164150/
This was done to address problems like T42612 and T64733.
Unfortunately breaking words on dots also breaks acronyms. The way acronyms are handled by the QueryString builder is by working on term positions thanks to the auto_generate_phrase_queries Query string option.
Relying on term positions for acronyms is imo a bad solution:
- it generates unexpected phrase queries that are impossible to control inside cirrus. For acronyms, these phrases can be with tremendously high frequency words (A.A.A).
- Scoring for phrase is generally sub-optimal, usually words weighting is based on index time statistics (docFreq), for phrases this value is unknown at rewrite time and will be approximated as a sum of the phrase terms idf.
Imo we should not rely on a system that can run phrase queries unexpectedly and acronyms should be handled at the term level.
Unfortunately working at term level (tokenization) is not easy esp. if we don't want to regress on T42612 and T64733.
Proposed quick fix/hack
I suggest to experiment re-using QueryString in the FullTextSimpleMatchQueryBuilder's "all filter". This will allow to continue to use the auto_generate_phrase_queries feature.
I'm really sad that we need to do this, because QueryString is really something I'd like to get rid of, but I don't see other options that could be implemented in a reasonable amount of time.