Page MenuHomePhabricator

no text in search results when the match is up to diacritics
Closed, ResolvedPublic

Description

Author: catlow

Description:
If you search for a word without diacritics, the search results include matches for the same word with diacritics. Similarly if you search for a phrase with a hyphen, the results include matches with an en dash. (No doubt there are various other similar rules, and this behaviour is very much desired.) However, when you get such a match, the matched text is not displayed in the list of search results, i.e. you get just a link to the relevant page, without the extract(s) from that page's text which you would normally see in the results list if the match were exact.


Version: unspecified
Severity: minor

Details

Reference
bz13849

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:12 PM
bzimport set Reference to bz13849.

Can you give an example URL?

I'm guessing this is on Wikipedia (or other Wikimedia site) and may be due to current mismatches between how the Lucene backend matches words and how the front-end matches them in the result highlighting. If so, I believe this should be improved when the next version of the Lucene backend rolls out which has support for doing highlighting itself.

catlow wrote:

Example URLs:

http://en.wikipedia.org/wiki/Special:Search?search=Banach-Steinhaus&fulltext=Search
(first result returned is Stefan Banach, but text is missing because in that article the reference contains an en dash rather than a hyphen)

http://en.wikipedia.org/wiki/Special:Search?search=sniezycowy&fulltext=Search
(two results returned, but text missing because the articles contain Sniezycowy with Polish diacritics)

rainman wrote:

This also happens for stemmed words, transliterations and words in different scripts (variants), and is as noted in #1 due to the mismatch between mediawiki highlighting and backend functionality. It will be solved when we switch highlighting to backend.

Mass close WONTFIX open Lucene Search issues because extension Lucene Search was removed, and replaced by MWSearch. Please set to REOPENED if behaviour still exists with a another component, and update the domain.

Mass REOPEN after discussion with Robert. Domain: Wikimedia/lucene-search-2. Assigned to maintainer.

rainman wrote:

Using our custom snippet-extraction backend on wmf wikis so this doesn't happen any more.