Page MenuHomePhabricator

Wrong processing of the apostrophe by the search engine in Ukrainian
Closed, ResolvedPublic

Description

Author: yevhen

Description:
In Ukrainian language the apostrophe usually appears in the middle of the word to mark the specific pronunciation of certain sounds. The problem is that the apostrophe symbol («’», U+2019) is probably threated by the search engine as a quotation mark, thus treating the word which it contains as two separate words. For example, the word «xxxxx’yyyyyy», will be recognized as two words xxxxx and yyyyyy. This makes such words impossible to find, and makes totally impossible to give the articles the names with the apostrophe.

This bug was never reported before, because on the keyboard the are some other symbols, looking as and used instead of the apostrophe: «‘», «'», «`», but according to Ukrainian typographic standard, the only right symbol is — «’», U+2019.

This bug doesn’t show up, if instead of the U+2019 symbol the ' mark is used, which is the temporary solution, widely used in Ukrainian wikipedia for the moment. But to keep Ukrainian wikipedia in line with the rules of the language, the U+2019 apostrophe should be processed correctly as well.


Version: unspecified
Severity: normal

Details

Reference
bz21002

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:48 PM
bzimport set Reference to bz21002.

Assigning to Robert and moving to lucene search component.

yevhen wrote:

The possible hint to the solution may be that fact, that the apostrophe in Ukrainian is never used with space before or after it. But the quotation mark does have a space nearby. It may help distinguishing them.

rainman wrote:

This should be easy to do, would need to add extra characters as apostrophe chars and reindex uk.wiki. Do you want all 4 as possible apostrophes or only the "proper" one?

yevhen wrote:

I guess we'd better quick discuss that in Ukrainian wiki. I'll let you know by tomorrow. Thanks!

yevhen wrote:

We decided that we need at least two symbols: «'» and «’» (U+2019), as the former is already used in many articles, and we will probably need some transition period when both symbols will be used equally. It would be great as well, if these symbols would be interchangeable from the search engine's point of view, so that the query «xx’yy» would find both «xx'yy» and «xx’yу» words. Thank you!

rainman wrote:

Fixed in r57932, needs index rebuild to go live (should be done in next couple of days).

yevhen wrote:

Thank you very much! I'll test it when the database will be reindexed and will let you know if everything is going well.

yevhen wrote:

Everything is fine, thank you!

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]