Page MenuHomePhabricator

CirrusSearch: Make sure all sorts of apostrophies count as word breaks
Open, MediumPublic

Description

Make sure all sorts of apostrophies count as word breaks. In particular, “L’Oréal”, “L Oréal”, and “L'Oréal” really ought to map to the same terms. Since there is a space in one of the terms, the only sane way to do that is to map them to two terms.


Version: unspecified
Severity: normal

Details

Reference
bz58701

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:36 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz58701.
bzimport added a subscriber: Unknown Object (MLST).

I know this is right for English, but maybe/probably not other languages.

froisois wrote:

(In reply to comment #1)

I know this is right for English, but maybe/probably not other languages.

This is right for French: apostrophes in this language are basically the elision of a vowel and a space.

(In reply to comment #2)

(In reply to comment #1)

I know this is right for English, but maybe/probably not other languages.

This is right for French: apostrophes in this language are basically the
elision of a vowel and a space.

The new search has a special filter to handle French's elision. Here it is: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-elision-tokenfilter.html . I'll crack open the code and see what it does when I start work on this bug.

froisois wrote:

(In reply to comment #3)

(In reply to comment #2)

(In reply to comment #1)

I know this is right for English, but maybe/probably not other languages.

This is right for French: apostrophes in this language are basically the
elision of a vowel and a space.

The new search has a special filter to handle French's elision. Here it is:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/
analysis-elision-tokenfilter.html
. I'll crack open the code and see what it does when I start work on this
bug.

This new filter seems great. (Your link doesn’t mention “d’” as a stop word, it will be worth the check when you hack the code.)
I’ve done some search tests on frwikisource and it appears that:

— apostrophes “'” and “’” are indeed interchangeable in the new Elasticsearch: priority is given to the apostrophe typed in the search box, but the other one is returned as well (e.g. the search “l'art d'avoir raison stratagème” first returns a redirection page, but also every occurrence of “L’Art d’avoir toujours raison”); although I don’t think that it’s due to the elision token filter: the search “Morestal lorsqu'il” returns the same result as “Morestal lorsqu’il”, even if “lorsqu” is not in this filter;

— despite this filter, apostrophes in french stop words don’t seem to break words either: the search “avoir toujours raison” doesn’t return “L’Art d’avoir toujours raison”, and the input “art d’avoir toujours raison” returns it but “Art” in the search result is not bolded.

mr.heat wrote:

In German we are using apostrophes much like in English and French. You can write "what is" as "what's" in English and "ist es" as "ist’s" in German. That's always two words.

A special example is "Peter’s Bar". That's actually wrong in German. It must be written as "Peters Bar". However, in both cases the "s" is not part of the name. So the conclusion is the same: two words.

In German we prefer U+2019 over every other character. However, people tend to misuse many other characters including U+0027, U+0060, U+00B4 and others.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript