Page MenuHomePhabricator

Phrase matching with stemming in CirrusSearch
Closed, ResolvedPublic

Description

Author: edwardbetts

Description:
Here is my example query: "station box" AND Helsinki

If I try this search on English Wikipedia I get 'Helsinki Metro' as result with LuceneSearch, but no results with CirrusSearch.

The wiki text contains this: "Two [[station box]]es were constructed in Hakaniemi."

Stemming in phrase searches in LuceneSearch was a bug, but now I have code that depends on this bug.

I found that Bug 54020, requested this change, disabling stemming in phrase matches.

It would be useful if it were possible to use CirrusSearch to search for terms next to each, like a phrase search, but with stemming. The syntax doesn't need to be the same as LuceneSearch. Stemming in phrase matching could be a tick box in advanced search and/or an extra parameter in the search API.


Version: unspecified
Severity: normal

Details

Reference
bz69226

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:41 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz69226.
bzimport added a subscriber: Unknown Object (MLST).

Already done: <<"station box"~ helsinki>>.

There is documentation for this but its kind of buried: https://www.mediawiki.org/wiki/Search/CirrusSearchFeatures#Quotes_and_exact_matches

edwardbetts wrote:

A search for "station box" on LuceneSearch gives 42 results. Searching for "station box"~ on CirrusSearch gives 327 results, so the new search is matching many more pages.

An example from the first 20 CirrusSearch reslts, [[Ormside railway station]], doesn't contain any occurrence of the term 'station' followed by the term 'box'.

That looks like a phrase slop error. The default slop should be 0 but is 1 in this case.

For the most part this is caused by the phrase slop issue I mention earlier. The temporary work around is to search for <"station box"~0~>. What uses 0 slop and stemmed matching. I'm switching the default slop to 0 for the stemmed matching so you won't have to do this in a few weeks once its merged and deployed.

Another issue that is causing extra results is a thing called "position offset gap". For fields in the search that are multivalued a search for "station box" can currently find matches _across_ those multiple values. I found that issue while working one something else a few weeks ago and the fix is being applied literally right now. It requires an index rebuild so give it 24 hours.

Change 153943 had a related patch set uploaded by Manybubbles:
Switch default phrase slop to 0

https://gerrit.wikimedia.org/r/153943

Change 153943 merged by jenkins-bot:
Switch default phrase slop to 0

https://gerrit.wikimedia.org/r/153943