Page MenuHomePhabricator

CirrusSearch shouldn't provide alternative spelling when an exact result is found
Closed, ResolvedPublic

Description

  1. Go to http://ca.wikipedia.org
  1. Search "José Mourinho"

Or click here:

https://ca.wikipedia.org/w/index.php?title=Especial%3ACerca&profile=advanced&search=Jos%C3%A9+Mourinho&fulltext=Search&ns0=1&ns4=1&ns10=1&ns12=1&redirs=1&profile=advanced

EXPECTED

If such page exists just show it.

ACTUALLY

Even if the exact page exists and it is listed first in the results, the first message displayed is "Did you mean: jose mourinho"

Well no, I really meant José Mourinho. :)

This is the first time I pay enough attention to detect this problem consciously (and report it) but I have seen more cases like this. Maybe the pattern is the accent in the search term? I will keep watching.


Version: master
Severity: normal
See Also:
T41501

Details

Reference
bz59666

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 2:18 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz59666.

Elasticsearch won't make a suggestion unless the suggested text appears to be about 2x as likely as the provided text (our configuration) so I'm guessing this is caused by us getting suggestions from redirect as well as titles. I'll have a look at it soon.

Another thing: I believe the fix for this will be to not provide a suggestion when the entire title is matched. I think it'd be more appropriate for me to implement this in CirrusSearch even though I'm sure that LuceneSearch has the same problem. The reason for this is that if I implement the fix in Cirrus then, one day, when I find a really good excuse to violate the rule, I'll be able to without having to make more convoluted changes to core. I know, YAGNI, but my gut says do it in Cirrus and I'm going to trust it.

(In reply to comment #1)

Elasticsearch won't make a suggestion unless the suggested text appears to be
about 2x as likely as the provided text (our configuration) so I'm guessing
this is caused by us getting suggestions from redirect as well as titles.
I'll
have a look at it soon.

Another thing: I believe the fix for this will be to not provide a
suggestion
when the entire title is matched. I think it'd be more appropriate for me to
implement this in CirrusSearch even though I'm sure that LuceneSearch has the
same problem. The reason for this is that if I implement the fix in Cirrus
then, one day, when I find a really good excuse to violate the rule, I'll be
able to without having to make more convoluted changes to core. I know,
YAGNI,
but my gut says do it in Cirrus and I'm going to trust it.

I was about to say the exact same thing, except let's fix it in core for all search engines.

It makes no sense to have "Did you mean Foo?" "There is a page called 'Foo'" like 3 lines apart on the same page :)

I was really thinking about it in core too, but that little imp in my said we'd want to break that rule one day.

I dunno. Also, I'd like to look into how that "There is a page called 'Foo'" thing comes up. Does it use the near match hook? If so, it'll work properly on wikis with Cirrus as primary come Monday because we're turning off TitleKey for them.

I imagine there are cases where we'll return a fully highlighted title but not have a page that matches the results. Oooh, and check this out: if that fully highlighted title is on the first page of the search results, we'd start showing the did you mean on the second page!

Needs more investigation!

OK! I had a look at it.

  1. "There is a page called 'Foo'" comes from Title::newFromText( $term )->isKnown(). We certainly shouldn't provide a suggestion in that case.
  2. It is still possible for CirrusSearch or LuceneSearch to provide a great match even though the text isn't known. Try searching for "pickett charge" or even "main pages". The top result is obviously good enough for it not to be worth showing a suggestion but cirrus does it any way.

I think #1 we should fix in core.
#2 we should probably do in Cirrus because it knows more about how it highlights.

I'll do both.

Change 105705 had a related patch set uploaded by Manybubbles:
Don't suggest if the search tem is a known title

https://gerrit.wikimedia.org/r/105705

That patch was #1 in core. #2 in cirrus coming later.

Change 105705 merged by jenkins-bot:
Don't suggest if the search term is a known title

https://gerrit.wikimedia.org/r/105705

Change 106523 had a related patch set uploaded by Manybubbles:
Don't suggest anything if a result is a full match

https://gerrit.wikimedia.org/r/106523

Change 106523 merged by jenkins-bot:
Don't suggest anything if a result is a full match

https://gerrit.wikimedia.org/r/106523

Change 107663 had a related patch set uploaded by Chad:
Don't suggest if the search term is a known title

https://gerrit.wikimedia.org/r/107663

Change 107663 abandoned by Chad:
Don't suggest if the search term is a known title

Reason:
Nevermind, just wait til tomorrow :)

https://gerrit.wikimedia.org/r/107663