Improve processing of the apostrophe by the search engine in Ukrainian
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Sasha1024
	Sep 22 2016, 9:56 AM

Description

There are three ways to represent Ukrainian apostrophe:

U+0027 ' APOSTROPHE — produces bad-looking apostrophe, but is ASCII-compatible.
U+2019 <cannot insert sample> RIGHT SINGLE QUOTATION MARK
U+02BC ʼ MODIFIER LETTER APOSTROPHE

The second and third look identical (according to Unicode Code Charts), the difference between them is that second is a punctuation mark and the third is considered as part of a word.

Todays tendency is to use third one. It is chosen as character for apostrophe in Ukrainian IDNs, it is chosen as main apostrophe in the Ukrainian Unicode (default keyboard layout for X.Org) (U+02BC is now located on the button where U+2019 was formerly located).

Ukrainian Wikipedia mostly uses the first in text and titles for compatibility. It is possible to use the second in search (full-text search for "м<U+2019>ясо" shows "м<U+0027>ясо" in results, and quick search immediately goes there) (maybe it's lead of T23002). It isn't possible to use the third in search yet.

Please, make U+02BC to act similarly to U+2019 in search. (Currently U+2019 and U+02BC are "competing" characters for typographically correct apostrophe (although U+02BC was chosen by ICANN, usage of U+2019 started earlier).)

Details

	Subject	Repo	Branch	Lines +/-
	Improve processing of the apostrophe by the search engine in Ukrainian	mediawiki/extensions/CirrusSearch	master	+34 -16

Customize query in gerrit

Related Objects

Mentioned In: T160106: Test and analyze new Ukrainian language analyzers
T147505: [tracking] CirrusSearch: what is updated during re-indexing
Mentioned Here: T148051: Track progress of Ukrainian Analyzer in Lucene/Elastic
T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers
T124592: Cyrillic 'Е' and 'Ё' equivalence not found by search
T23002: Wrong processing of the apostrophe by the search engine in Ukrainian

Event Timeline

Sasha1024 created this task.Sep 22 2016, 9:56 AM

Restricted Application added subscribers: Base, Aklapper. · View Herald TranscriptSep 22 2016, 9:56 AM

Sasha1024 updated the task description. (Show Details)Sep 22 2016, 10:00 AM

Sasha1024 updated the task description. (Show Details)

Aklapper added a project: CirrusSearch.Sep 22 2016, 11:21 AM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptSep 22 2016, 11:21 AM

debt triaged this task as Medium priority.Sep 22 2016, 10:04 PM

debt moved this task from needs triage to This Quarter on the Discovery-Search board.

debt added a subscriber: TJones.

debt moved this task from This Quarter to Current work on the Discovery-Search board.Oct 4 2016, 5:33 PM

debt edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones claimed this task.Oct 4 2016, 5:36 PM

TJones moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.Oct 6 2016, 5:00 PM

TJones mentioned this in T147505: [tracking] CirrusSearch: what is updated during re-indexing.Oct 6 2016, 6:04 PM

I've discovered today that Ukrainian-language text is processed using the Russian-language analyzer. At the moment, that seems to mean that this change has to be made to the Russian-language analyzer, and that changes made to the Russian analyzer, such as T124592, will affect Ukrainian (and others). I've put together a list of the fallback languages specified in MediaWiki.

I'm going to need to think about this a bit more.

EDIT: It is possible to special-case bits of the analyzer by language, but it's moderately hacky.

@TJones, IMHO treating all three apostrophes (U+0027, U+2019, U+02BC) as one sign in search shouldn't harm Russian or any other languages using Russian-language analyzer.

U+2019 and U+02BC look identical according to Unicode code charts, and even for English language there are (not so popular) opinions that U+2019 is wrong selection and U+02BC should be used instead (for apostrophe). U+0027 is common backward-compatible low-quality substitution for both U+2019 and U+02BC.
The only question remaining is whether the apostrophe sign (either U+0027, U+2019 or U+02BC) should be treated as part-of-word character or punctuation character? My answer is: as part-of-word character. Some Cyrillic-script languages (like Ukrainian) consider apostrophe as important part-of-word sign with specific meaning; some Cyrillic-script languages (like Russian) don't use apostrophe in own words, but sometimes use it in loanwords (like Jeanne d'Arc); but it both cases this seems to be part-of-word character¹, not punctuation character. Cyrillic-script languages usually don't use single quotes¹, only double quotes or special (e.g. «»).

¹ — actually, this should be checked (that all listed here Russian-analyzer-dependent languages use apostrophe only as part-of-word, not as punctuation; and that no one of them uses single quotes). But I'm 90% sure that it's true. And, actually, if I understand correctly, canonic single quote characters (U+0027, U+2019) were already set to be interpreted as part-of-word in Russian analyzer by T23002 (maybe not always, maybe only when not near to space — but the fact is that characters that could at least sometimes mean single quote are already set to be treated as equivalent apostrophes, but character-that-never-can-mean-single-quote (U+02BC) isn't).

P.S.: Of course, there is no need to hurry. I'm not against any deep analysis and farsighted solutions. I just expressed my opinion.

@Sasha1024 , thanks for the feedback!

This particular case of the apostrophe-like characters isn't too bad, as you outlined, though I do worry about languages that might use the apostrophe as a proper letter; I know some do—though I don't know of any that also use Cyrillic.

I generally like to be conservative with changes, especially when affecting a number of languages that we haven't specifically investigated. I'm trying to work out a general framework for handling this kind of situation. I'm wondering, for example, whether T124592 should be special-cased just to Russian. I don't think it affects Ukrainian, but it may effect others.

So, taking it a little slow sounds good. But....

We have a bit of a deadline because this change requires the relevant wikis to be re-indexed. We don't do that often, though a re-index is coming up, related to the BM25 implementation (which is why I've recently been looking at changes that need a re-index). We'll be trying to do re-indexes more often in the future, but we have to see how this one goes before we have a good idea of how difficult it is. We can delay the re-indexing of some wikis while we work on patches that require re-indexing, but we can't delay forever.

@TJones

though I do worry about languages that might use the apostrophe as a proper letter

Sorry for stupid question, but what is difference between apostrophe-as-proper-letter and apostrophe-as-part-of-word-but-still-not-considered-a-letter? I.e. I understand difference between apostrophe-as-part-of-word and apostrophe-as-punctuation-mark: in first case "abc'def" gives one lexeme, and in second it gives two lexemes ("abc" and "def"). But I don't understand what is practical difference for search between something considered a letter and something considered a part of a word (but still not letter).

I'm wondering, for example, whether T124592 should be special-cased just to Russian.

IMO, you're right. All wikis say that Ё is used only in Russian, Belorussian and Rusyn, and enwiki precises: "[unlike Russian] in Belarusian and Rusyn, the letters Е and Ё are separate and not interchangeable". Although I don't know exact rules of Belarusian and Rusyn, so I can't confirm this personally. (Although theoretically it may happen, that despite formal rules of Belarusian and Rusyn forbid exchanging Е and Ё, Belarusian and Rusyn wikis will still benefit from Е=Ё equivalence in search due to some specific reasons, for example cases of informal/erroneous/fast typing — I suppose it to be unlikely, but I can't fully exclude such possibility.)

Thanks for your answer.

I would like to clarify what is needed here. What we need is to make sure that a user typing any of these three apostrophes will be able to find the article they need.

The current state is the following:

apostrophe U+0027 ( ' ) is used by default in Ukrainian projects, both in text and in page titles
apostrophe U+2019 ( ’ ) is supported since T23002
apostrophe U+02BC ( ʼ ) is not supported yet. What is requested is to add a support for it in the same way as it was done in T23002.

For example, let's pick a random name, e.g. Бустарв'єхо

Search for Бустарв'єхо finds the article and other search results
Search for Бустарв’єхо still finds the article despite the difference in apostrophes, owing to support previously added
Search for Бустарвʼєхо does not find anything. This should be fixed

@Sasha1024,

what is difference between apostrophe-as-proper-letter and apostrophe-as-part-of-word-but-still-not-considered-a-letter?

Sorry for the confusion. I was mostly thinking out loud and worrying about details while I didn't really understand the technical situation. The difference—if the relevant language analyzer made the distinction—might be between keeping the apostrophe in the word, dropping the apostrophe (but not splitting the word on it), and splitting words at apostrophes.

Given that the Russian analyzer is (or would be) used for Abkhazian, Avar, Bashkir, Buryat, Chechen, Crimean Tatar, Chuvash, Ingush, Komi-Permyak, Karachay-Balkar, Komi, Lak, Lezgian, Meadow Mari, Hill Mari, Erzya, Livvi, Ossetian, Sakha, Tatar, Tuvan, Udmurt, Ukrainian, and Kalmyk, I don't know off the top of my head whether any of those languages might react poorly to any particular method of dealing with apostrophes and apostrophe-like characters, so I worry.

It's going to take a while, but I plan to unravel the fallback mess—which I now understand was motivated by historical and geographical concerns, not linguistic ones, See T147959.

In the short term, I think I have a technical solution that will limit the Ukrainian changes to Ukrainian wikis. I need to run some tests and talk to other developers about the changes, but I hope to get it done soon, unless the result is a technical abomination.

@NickK,

Thanks for the clarification. The problem, which I hope I've made clear now, is that there is an unfortunate entanglement among linguistically unrelated languages that makes any changes have potentially far-reaching effects on wikis in other languages. So, I know what needs to be done for Ukrainian, but I wasn't sure how to do it in a clean way while being conservative in its potential effects on other languages.

Doing it right also means determining a general framework for splitting out these kinds of distinctions. The long-term plan is T147959, but I want to make this change for Ukrainian sooner rather than later if technically possible. I think it is, but I have to run some more tests.

@Sasha1024 & @NickK,

Do either of you (or anyone else reading along) have any strong feelings about the use of the Russian language analyzer for Ukrainian? At least Ukrainian and Russian are somewhat linguistically similar (unlike other fallback language pairs, like Wolof and French). At a guess, is it a net gain to treat Ukrainian words like Russian words (i.e., for determining root words based on the morphology) rather than treating Ukrainian words as plain strings?

As an example, if you apply English analysis to Spanish, you do okay on plurals (the singular of gatos is in fact gato), but you miss the connection between masculine and feminine adjectives (rojo and roja are not treated as variants of the same word), and different forms of verbs (hablo, hablas, hablamos) are also not connected. English analysis doesn't seem to do anything ridiculous to Spanish words in general, though it misses a lot. It's probably not worth doing in the absence of a Spanish analyzer, but it isn't horrible.

(I'm also trying to figure out how best to ask this question, since we'll want to ask it a lot as we untangle the existing fallbacks. Avoiding linguistic terminology and examples in languages the person I'm asking likely doesn't know makes it very hard.)

Well Russian, Ukrainian and Belorussian are quite close. (Polish is close too, some claim that it's even closer to Ukrainian than Russian — but as Polish uses different alphabet, it seems to be out of scope.) (BTW, what is about Belorussian? Does it use different lexical analyzer? Theoretically, it may fit Ukrainian even better than Russian.)

I can't say whether it's good to use Russian lexical analyzer for Ukrainian, because I don't know what exactly it does. Theoretically, if I ever do lexical analyzers for Ukrainian/Belorussian/Russian, I'd better do one combined for all three (with specifying exact language as an option affecting some conditional switches in program), rather than doing each one from scratch. But if it's about just applying purely Russian (not semi-universal) analyzer to Ukrainian... well, I don't know.

Ukrainian has almost the same grammatical categories as Russian. I.e. Russian nouns has 6 grammatical cases, Ukrainian has the same 6 + 1 more; Russian nouns belong to 3 declensions, Ukrainian has the same 3 + 1 more. Russian verbs have 3 tenses, Ukrainian has the same 3 + 1 more.
However these same grammatical categories are often implemented differently (in pronunciation and writing). Even when words sound similarly (so that Russian and Ukrainian can understand each other without a dictionary), then often are written differently. Additionally, Ukrainian seems to more often have alternating letter in the root of word during conjugation.

E. g.:

Case	Russian (cat)	Ukrainian (cat)	Russian (cats)	Ukrainian (cats)	Russian (book)	Ukrainian (book)
Nominative	Кот	Кіт	Коты	Коти	Книжка	Книжка
Genitive	Кота	Кота	Котов	Котів	Книжки	Книжки
Dative	Коту	Коту	Котам	Котам	Книжке	Книжці
Accusative	Кота	Кота	Котов	Котів	Книжку	Книжку
Instrumental	Котом	Котом	Котами	Котами	Книжкой (книжкою)	Книжкою
Locative	Коте	Коті	Котах	Котах	Книжке	Книжці

Tense/Preposition	Russian (to bring)	Ukrainian (to bring)
Infinitive	Нести	Нести
Present/I	Несу	Несу
Present/We	Несём	Несемо
Present/You	Несёшь	Несеш
Present/You (pl.)	Несёте	Несете
Present/He/She/It	Несёт	Несе
Present/They	Несут	Несуть
Past/He	Нёс	Ніс
Past/She	Несла	Несла
Past/It	Несло	Несло
Past/They	Несли	Несли
Future/I	Буду нести	Буду нести, нестиму
Future/We	Будем нести	Будемо нести, нестимемо
Future/You	Будешь нести	Будеш нести, нестимеш
Future/You (pl.)	Будете нести	Будете нести, нестимете
Future/He/She/It	Будет нести	Будете нести, нестиме
Future/They	Будут нести	Будуть нести, нестимуть
Imperative/You	Неси	Неси
Imperative/You (pl.)	Несите	Несіть

So, there will be a lot of misses. But these misses in >80% cases (I mean here use-cases, not grammatical-cases) won't wrongly merge different words (e.g... can't even find a sample). So, yes, it seems that Ukrainian would benefit from using Russian analyzer (or Belorussian?). When I started to write this message I was unsure, now I'm sure. Although some people may get confused: «Why searching for „коту“ will find „кота“, but won't find „кіт“?» — it's better than nothing.

In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.

Also although Ukrainian is close to Russian it has quite a bit of specifics, as @Sasha1024 already pointed out: different inflections, in fact inflections more complicated (with alternating letters in the stem) so using Russian analyzer may produce poor results.
We have pretty good NLP for Ukrainian in LanguageTool (https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/uk), and we just added support for lemmatizing stemmer for Ukrainian in Lucene (https://issues.apache.org/jira/browse/LUCENE-7287).
I am wondering if we should take a look if we should create separate analyzer for Ukrainian.

In T146358#2711240, @Sasha1024 wrote:

I can't say whether it's good to use Russian lexical analyzer for Ukrainian, because I don't know what exactly it does.

The analyzers vary in their completeness and aggressiveness, but generally the idea is to reduce the word to a stem (sometimes a couple of different variant stems). Ideally, all related words would have the same stem, and all unrelated words would have different stems—though of course language is too messy for that.

What is about Belorussian? Does it use different lexical analyzer? Theoretically, it may fit Ukrainian even better than Russian.

Belarusian doesn't have a language-specific analyzer. It uses the Elasticsearch "default" analyzer, which is probably what most wikis should be using, rather than these fallbacks.

Elastic has analyzers for these languages: Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai. All others should probably be using "default".

Thanks for the crash course in comparative East Slavic!

I ran the Russian and Ukrainian forms through the Russian analyzer to see what happens. Stemmed forms are below—and I numbered the stemmed forms of Нести for those whose Cyrillic pattern matching is not up to the task. I'm surprised the analyzer doesn't do better on the Russian verbs.

Case

Russian (cat)

Stemmed

Ukrainian (cat)

Stemmed

Russian (cats)

Stemmed

Ukrainian (cats)

Stemmed

Russian (book)

Stemmed

Ukrainian (book)

Stemmed

Nominative

Кот

кот

Кіт

кіт

Коты

кот

Коти

кот

Книжка

книжк

Книжка

книжк

Genitive

Кота

кот

Кота

кот

Котов

кот

Котів

котів

Книжки

книжк

Книжки

книжк

Dative

Коту

кот

Коту

кот

Котам

кот

Котам

кот

Книжке

книжк

Книжці

книжці

Accusative

Кота

кот

Кота

кот

Котов

кот

Котів

котів

Книжку

книжк

Книжку

книжк

Instrumental

Котом

кот

Котом

кот

Котами

кот

Котами

кот

Книжкой (книжкою)

книжк

Книжкою

книжк

Locative

Коте

кот

Коті

коті

Котах

кот

Котах

кот

Книжке

книжк

Книжці

книжці

Tense/Preposition	Russian (to bring)	Stemmed	Ukrainian (to bring)	Stemmed
Infinitive	Нести	(1) нест	Нести	(1) нест
Present/I	Несу	(2) нес	Несу	(2) нес
Present/We	Несём	(2) нес	Несемо	(7) несем
Present/You	Несёшь	(3) несеш	Несеш	(3) несеш
Present/You (pl.)	Несёте	(4) несет	Несете	(4) несет
Present/He/She/It	Несёт	(4) несет	Несе	(2) нес
Present/They	Несут	(5) несут	Несуть	(5) несут
Past/He	Нёс	(2) нес	Ніс	(8) ніс
Past/She	Несла	(6) несл	Несла	(6) несл
Past/It	Несло	(6) несл	Несло	(6) несл
Past/They	Несли	(6) несл	Несли	(6) несл
Future/I	Буду нести	(1) буд нест	Буду нести, нестиму	(1) буд нест, нестим
Future/We	Будем нести	(1) буд нест	Будемо нести, нестимемо	(1) будем нест, нестимем
Future/You	Будешь нести	(1) будеш нест	Будеш нести, нестимеш	(1) будеш нест, нестимеш
Future/You (pl.)	Будете нести	(1) будет нест	Будете нести, нестимете	(1) будет нест, нестимет
Future/He/She/It	Будет нести	(1) нест	Будете нести, нестиме	(1) будет нест, нестим
Future/They	Будут нести	(1) будут нест	Будуть нести, нестимуть	(1) будут нест, нестимут
Imperative/You	Неси	(2) нес	Неси	(2) нес
Imperative/You (pl.)	Несите	(2) нес	Несіть	(9) несіт

So, yes, it seems that Ukrainian would benefit from using Russian analyzer (or Belorussian?). When I started to write this message I was unsure, now I'm sure. Although some people may get confused: «Why searching for „коту“ will find „кота“, but won't find „кіт“?» — it's better than nothing.

I'm tending to agree with you, and since it's the status quo, we'll certainly leave it for now. Thanks for helping me get a much better understanding of the linguistic situation here!

In T146358#2712850, @dalekiy_obriy wrote:

In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.

That's definitely the goal. And of course we can't dictate what people use on any of the wikis—that's up to the community to determine. My goal is to figure out what people actually do and make search do the best it can in that context.

Also although Ukrainian is close to Russian it has quite a bit of specifics, as @Sasha1024 already pointed out: different inflections, in fact inflections more complicated (with alternating letters in the stem) so using Russian analyzer may produce poor results.

Definitely—that's what I'd normally expect, even in closely related languages. What @Sasha1024 and I were discussing is whether or not the Russian analyzer could do anything useful at all. Surprisingly to both of us, it seems like it kind of does.

We have pretty good NLP for Ukrainian in LanguageTool (https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/uk), and we just added support for lemmatizing stemmer for Ukrainian in Lucene (https://issues.apache.org/jira/browse/LUCENE-7287).
I am wondering if we should take a look if we should create separate analyzer for Ukrainian.

Hey, that's cool! I'm not sure what our policy and procedure is for installing components from outside the Elasticsearch core, but I'll look into it.

In terms of immediate action, I'm going to stick to the the scope of this task (apostrophe-like characters) and try to get that doing the right thing before the big re-index we have coming up, but I very much appreciate all the help and information. I have a much better grasp of our technical details (all those unexpected fallbacks!) and the linguistic details of Ukrainian.

As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?

In T146358#2713159, @dalekiy_obriy wrote:

As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?

That would be best. We'd have to update Elastic and Lucene to the relevant version, and then make sure the Ukrainian analyzer is available internally (some configs may need to be tweaked) and then enable it and re-index. That's not going to happen automatically—particularly the re-indexing—so a new ticket to enable the Ukrainian analyzer would be good.

(I'll be keeping an eye out for it, too—but a nudge never hurts.)

Yes a ticket to track the status of https://github.com/elastic/elasticsearch/issues/19433 would be nice I think.

Okay, I've opened the ticket to track the Ukrainian analyzer: T148051

Change 315837 had a related patch set uploaded (by Tjones):
Improve processing of the apostrophe by the search engine in Ukrainian

https://gerrit.wikimedia.org/r/315837

gerritbot added a project: Patch-For-Review.Oct 13 2016, 9:49 PM

TJones moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Oct 14 2016, 3:51 PM

Change 315837 merged by jenkins-bot:
Improve processing of the apostrophe by the search engine in Ukrainian

https://gerrit.wikimedia.org/r/315837

TJones moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 17 2016, 2:17 PM

ReleaseTaggerBot added a project: MW-1.28-release (WMF-deploy-2016-10-25_(1.28.0-wmf.23)).Oct 17 2016, 3:00 PM

• Deskana closed this task as Resolved.Nov 17 2016, 9:46 PM

TJones mentioned this in T160106: Test and analyze new Ukrainian language analyzers.Mar 9 2017, 9:55 PM

Improve processing of the apostrophe by the search engine in UkrainianClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Improve processing of the apostrophe by the search engine in Ukrainian
Closed, ResolvedPublic
Actions