Page MenuHomePhabricator

Fails to find same word with an apostrophe before (French usage)
Closed, ResolvedPublic

Description

When I search a word, the search engine fail to find the same word when it has an apostrophe before, so a search for "apostrophe" doesn't find "L'apostrophe" occurrence.
In French the apostrophe is not part of the word, its a contraction for "La apostrophe".

For example :
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=Arc-en-Ciel&fulltext=Search
"Arc-en-Ciel" doesn't match any "L'Arc-en-Ciel" (with L') directly.

Compare with a search "L'Arc-en-Ciel"
https://en.wikipedia.org/w/index.php?search=L%27Arc-en-Ciel&title=Special%3ASearch&fulltext=1

In French, like in English, apostrophe should be not indexed as part of the word.

Note : its the same bug than https://bugzilla.wikimedia.org/show_bug.cgi?id=9598 (old)
See also a different apostrophe usage in Ukrainian https://bugzilla.wikimedia.org/show_bug.cgi?id=21002


Version: unspecified
Severity: normal

Details

Reference
bz57832

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 2:36 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz57832.
bzimport added a subscriber: Unknown Object (MLST).

This problem still exists in CirrusSearch. Migrating bug to correct queue.

The problem here is that the language rules are customized for the wiki's language. Elision is handled in French but not English.

I wonder how much harm it would be to just add it to English (and maybe other languages) as well. Here are the term prefixes that would be removed:
l'
m'
t'
qu'
n'
s'
j'
d'
c'
jusqu'
quoiqu'
lorsqu'
puisqu'

We wouldn't add it to the plain analyzer so if you search for "l'avion" then "l'avion" will be worth more then "avion".

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Restricted Application added a subscriber: Base. · View Herald Transcript
TJones closed this task as Resolved.EditedDec 11 2018, 6:54 PM
TJones claimed this task.
TJones subscribed.

The problem here is that the language rules are customized for the wiki's language. Elision is handled in French but not English.

That is the crux of the original problem.

I wonder how much harm it would be to just add it to English (and maybe other languages) as well. Here are the term prefixes that would be removed:

I don't think this is a great idea. We could pick some low-hanging fruit by adding, say, some French rules to English-language processing, but should we also add something for German, or Spanish, or Ukrainian? Do we only add them to English, or also to Welsh?

Which "foreign" rules will interact poorly with which native languages? Hard to say without a lot of work. It would also require unpacking the monolithic analyzers we have to enable it, and some of the third party analyzers can't be unpacked, so they would be treated differently, etc., etc.

However, the current language analysis for English splits on apostrophes (which is actually not always great—searching for don't can match Don T. and Donal T. and donning T-shirts—though instances of "don't" are generally ranked higher unless you do a lot of work), so the original problem of searching for Arc-en-Ciel and finding l'Arc-en-Ciel is working correctly now.

However, because the English analyzer doesn't drop French stop words, searching for d'homme or l'homme will not find all instances of homme.

I'm going to close the ticket, because the main issue is working now and I don't think we're going to enable any more French-specific analysis on any non-French wikis.