Page MenuHomePhabricator

Improve processing of the apostrophe by the search engine in Ukrainian
Closed, ResolvedPublic

Description

There are three ways to represent Ukrainian apostrophe:

  1. U+0027 ' APOSTROPHE — produces bad-looking apostrophe, but is ASCII-compatible.
  2. U+2019 <cannot insert sample> RIGHT SINGLE QUOTATION MARK
  3. U+02BC ʼ MODIFIER LETTER APOSTROPHE

The second and third look identical (according to Unicode Code Charts), the difference between them is that second is a punctuation mark and the third is considered as part of a word.

Todays tendency is to use third one. It is chosen as character for apostrophe in Ukrainian IDNs, it is chosen as main apostrophe in the Ukrainian Unicode (default keyboard layout for X.Org) (U+02BC is now located on the button where U+2019 was formerly located).

Ukrainian Wikipedia mostly uses the first in text and titles for compatibility. It is possible to use the second in search (full-text search for "м<U+2019>ясо" shows "м<U+0027>ясо" in results, and quick search immediately goes there) (maybe it's lead of T23002). It isn't possible to use the third in search yet.

Please, make U+02BC to act similarly to U+2019 in search. (Currently U+2019 and U+02BC are "competing" characters for typographically correct apostrophe (although U+02BC was chosen by ICANN, usage of U+2019 started earlier).)

Event Timeline

debt triaged this task as Medium priority.Sep 22 2016, 10:04 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
debt added a subscriber: TJones.

I've discovered today that Ukrainian-language text is processed using the Russian-language analyzer. At the moment, that seems to mean that this change has to be made to the Russian-language analyzer, and that changes made to the Russian analyzer, such as T124592, will affect Ukrainian (and others). I've put together a list of the fallback languages specified in MediaWiki.

I'm going to need to think about this a bit more.

EDIT: It is possible to special-case bits of the analyzer by language, but it's moderately hacky.

@TJones, IMHO treating all three apostrophes (U+0027, U+2019, U+02BC) as one sign in search shouldn't harm Russian or any other languages using Russian-language analyzer.

  1. U+2019 and U+02BC look identical according to Unicode code charts, and even for English language there are (not so popular) opinions that U+2019 is wrong selection and U+02BC should be used instead (for apostrophe). U+0027 is common backward-compatible low-quality substitution for both U+2019 and U+02BC.
  2. The only question remaining is whether the apostrophe sign (either U+0027, U+2019 or U+02BC) should be treated as part-of-word character or punctuation character? My answer is: as part-of-word character. Some Cyrillic-script languages (like Ukrainian) consider apostrophe as important part-of-word sign with specific meaning; some Cyrillic-script languages (like Russian) don't use apostrophe in own words, but sometimes use it in loanwords (like Jeanne d'Arc); but it both cases this seems to be part-of-word character¹, not punctuation character. Cyrillic-script languages usually don't use single quotes¹, only double quotes or special (e.g. «»).

¹ — actually, this should be checked (that all listed here Russian-analyzer-dependent languages use apostrophe only as part-of-word, not as punctuation; and that no one of them uses single quotes). But I'm 90% sure that it's true. And, actually, if I understand correctly, canonic single quote characters (U+0027, U+2019) were already set to be interpreted as part-of-word in Russian analyzer by T23002 (maybe not always, maybe only when not near to space — but the fact is that characters that could at least sometimes mean single quote are already set to be treated as equivalent apostrophes, but character-that-never-can-mean-single-quote (U+02BC) isn't).

P.S.: Of course, there is no need to hurry. I'm not against any deep analysis and farsighted solutions. I just expressed my opinion.

@Sasha1024 , thanks for the feedback!

This particular case of the apostrophe-like characters isn't too bad, as you outlined, though I do worry about languages that might use the apostrophe as a proper letter; I know some do—though I don't know of any that also use Cyrillic.

I generally like to be conservative with changes, especially when affecting a number of languages that we haven't specifically investigated. I'm trying to work out a general framework for handling this kind of situation. I'm wondering, for example, whether T124592 should be special-cased just to Russian. I don't think it affects Ukrainian, but it may effect others.

So, taking it a little slow sounds good. But....

We have a bit of a deadline because this change requires the relevant wikis to be re-indexed. We don't do that often, though a re-index is coming up, related to the BM25 implementation (which is why I've recently been looking at changes that need a re-index). We'll be trying to do re-indexes more often in the future, but we have to see how this one goes before we have a good idea of how difficult it is. We can delay the re-indexing of some wikis while we work on patches that require re-indexing, but we can't delay forever.

@TJones

though I do worry about languages that might use the apostrophe as a proper letter

Sorry for stupid question, but what is difference between apostrophe-as-proper-letter and apostrophe-as-part-of-word-but-still-not-considered-a-letter? I.e. I understand difference between apostrophe-as-part-of-word and apostrophe-as-punctuation-mark: in first case "abc'def" gives one lexeme, and in second it gives two lexemes ("abc" and "def"). But I don't understand what is practical difference for search between something considered a letter and something considered a part of a word (but still not letter).

I'm wondering, for example, whether T124592 should be special-cased just to Russian.

IMO, you're right. All wikis say that Ё is used only in Russian, Belorussian and Rusyn, and enwiki precises: "[unlike Russian] in Belarusian and Rusyn, the letters Е and Ё are separate and not interchangeable". Although I don't know exact rules of Belarusian and Rusyn, so I can't confirm this personally. (Although theoretically it may happen, that despite formal rules of Belarusian and Rusyn forbid exchanging Е and Ё, Belarusian and Rusyn wikis will still benefit from Е=Ё equivalence in search due to some specific reasons, for example cases of informal/erroneous/fast typing — I suppose it to be unlikely, but I can't fully exclude such possibility.)

Thanks for your answer.

I would like to clarify what is needed here. What we need is to make sure that a user typing any of these three apostrophes will be able to find the article they need.

The current state is the following:

  • apostrophe U+0027 ( ' ) is used by default in Ukrainian projects, both in text and in page titles
  • apostrophe U+2019 ( ’ ) is supported since T23002
  • apostrophe U+02BC ( ʼ ) is not supported yet. What is requested is to add a support for it in the same way as it was done in T23002.

For example, let's pick a random name, e.g. Бустарв'єхо

  • Search for Бустарв'єхо finds the article and other search results
  • Search for Бустарв’єхо still finds the article despite the difference in apostrophes, owing to support previously added
  • Search for Бустарвʼєхо does not find anything. This should be fixed

@Sasha1024,

what is difference between apostrophe-as-proper-letter and apostrophe-as-part-of-word-but-still-not-considered-a-letter?

Sorry for the confusion. I was mostly thinking out loud and worrying about details while I didn't really understand the technical situation. The difference—if the relevant language analyzer made the distinction—might be between keeping the apostrophe in the word, dropping the apostrophe (but not splitting the word on it), and splitting words at apostrophes.

Given that the Russian analyzer is (or would be) used for Abkhazian, Avar, Bashkir, Buryat, Chechen, Crimean Tatar, Chuvash, Ingush, Komi-Permyak, Karachay-Balkar, Komi, Lak, Lezgian, Meadow Mari, Hill Mari, Erzya, Livvi, Ossetian, Sakha, Tatar, Tuvan, Udmurt, Ukrainian, and Kalmyk, I don't know off the top of my head whether any of those languages might react poorly to any particular method of dealing with apostrophes and apostrophe-like characters, so I worry.

It's going to take a while, but I plan to unravel the fallback mess—which I now understand was motivated by historical and geographical concerns, not linguistic ones, See T147959.

In the short term, I think I have a technical solution that will limit the Ukrainian changes to Ukrainian wikis. I need to run some tests and talk to other developers about the changes, but I hope to get it done soon, unless the result is a technical abomination.

@NickK,

Thanks for the clarification. The problem, which I hope I've made clear now, is that there is an unfortunate entanglement among linguistically unrelated languages that makes any changes have potentially far-reaching effects on wikis in other languages. So, I know what needs to be done for Ukrainian, but I wasn't sure how to do it in a clean way while being conservative in its potential effects on other languages.

Doing it right also means determining a general framework for splitting out these kinds of distinctions. The long-term plan is T147959, but I want to make this change for Ukrainian sooner rather than later if technically possible. I think it is, but I have to run some more tests.

@Sasha1024 & @NickK,

Do either of you (or anyone else reading along) have any strong feelings about the use of the Russian language analyzer for Ukrainian? At least Ukrainian and Russian are somewhat linguistically similar (unlike other fallback language pairs, like Wolof and French). At a guess, is it a net gain to treat Ukrainian words like Russian words (i.e., for determining root words based on the morphology) rather than treating Ukrainian words as plain strings?

As an example, if you apply English analysis to Spanish, you do okay on plurals (the singular of gatos is in fact gato), but you miss the connection between masculine and feminine adjectives (rojo and roja are not treated as variants of the same word), and different forms of verbs (hablo, hablas, hablamos) are also not connected. English analysis doesn't seem to do anything ridiculous to Spanish words in general, though it misses a lot. It's probably not worth doing in the absence of a Spanish analyzer, but it isn't horrible.

(I'm also trying to figure out how best to ask this question, since we'll want to ask it a lot as we untangle the existing fallbacks. Avoiding linguistic terminology and examples in languages the person I'm asking likely doesn't know makes it very hard.)

Well Russian, Ukrainian and Belorussian are quite close. (Polish is close too, some claim that it's even closer to Ukrainian than Russian — but as Polish uses different alphabet, it seems to be out of scope.) (BTW, what is about Belorussian? Does it use different lexical analyzer? Theoretically, it may fit Ukrainian even better than Russian.)

I can't say whether it's good to use Russian lexical analyzer for Ukrainian, because I don't know what exactly it does. Theoretically, if I ever do lexical analyzers for Ukrainian/Belorussian/Russian, I'd better do one combined for all three (with specifying exact language as an option affecting some conditional switches in program), rather than doing each one from scratch. But if it's about just applying purely Russian (not semi-universal) analyzer to Ukrainian... well, I don't know.

  1. Ukrainian has almost the same grammatical categories as Russian. I.e. Russian nouns has 6 grammatical cases, Ukrainian has the same 6 + 1 more; Russian nouns belong to 3 declensions, Ukrainian has the same 3 + 1 more. Russian verbs have 3 tenses, Ukrainian has the same 3 + 1 more.
  2. However these same grammatical categories are often implemented differently (in pronunciation and writing). Even when words sound similarly (so that Russian and Ukrainian can understand each other without a dictionary), then often are written differently. Additionally, Ukrainian seems to more often have alternating letter in the root of word during conjugation.

    E. g.:
CaseRussian (cat)Ukrainian (cat)Russian (cats)Ukrainian (cats)Russian (book)Ukrainian (book)
NominativeКотКітКотыКотиКнижкаКнижка
GenitiveКотаКотаКотовКотівКнижкиКнижки
DativeКотуКотуКотамКотамКнижкеКнижці
AccusativeКотаКотаКотовКотівКнижкуКнижку
InstrumentalКотомКотомКотамиКотамиКнижкой (книжкою)Книжкою
LocativeКотеКотіКотахКотахКнижкеКнижці
Tense/PrepositionRussian (to bring)Ukrainian (to bring)
InfinitiveНестиНести
Present/IНесуНесу
Present/WeНесёмНесемо
Present/YouНесёшьНесеш
Present/You (pl.)НесётеНесете
Present/He/She/ItНесётНесе
Present/TheyНесутНесуть
Past/HeНёсНіс
Past/SheНеслаНесла
Past/ItНеслоНесло
Past/TheyНеслиНесли
Future/IБуду нестиБуду нести, нестиму
Future/WeБудем нестиБудемо нести, нестимемо
Future/YouБудешь нестиБудеш нести, нестимеш
Future/You (pl.)Будете нестиБудете нести, нестимете
Future/He/She/ItБудет нестиБудете нести, нестиме
Future/TheyБудут нестиБудуть нести, нестимуть
Imperative/YouНесиНеси
Imperative/You (pl.)НеситеНесіть
  1. So, there will be a lot of misses. But these misses in >80% cases (I mean here use-cases, not grammatical-cases) won't wrongly merge different words (e.g... can't even find a sample). So, yes, it seems that Ukrainian would benefit from using Russian analyzer (or Belorussian?). When I started to write this message I was unsure, now I'm sure. Although some people may get confused: «Why searching for „коту“ will find „кота“, but won't find „кіт“?» — it's better than nothing.

In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.

Also although Ukrainian is close to Russian it has quite a bit of specifics, as @Sasha1024 already pointed out: different inflections, in fact inflections more complicated (with alternating letters in the stem) so using Russian analyzer may produce poor results.
We have pretty good NLP for Ukrainian in LanguageTool (https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/uk), and we just added support for lemmatizing stemmer for Ukrainian in Lucene (https://issues.apache.org/jira/browse/LUCENE-7287).
I am wondering if we should take a look if we should create separate analyzer for Ukrainian.

I can't say whether it's good to use Russian lexical analyzer for Ukrainian, because I don't know what exactly it does.

The analyzers vary in their completeness and aggressiveness, but generally the idea is to reduce the word to a stem (sometimes a couple of different variant stems). Ideally, all related words would have the same stem, and all unrelated words would have different stems—though of course language is too messy for that.

What is about Belorussian? Does it use different lexical analyzer? Theoretically, it may fit Ukrainian even better than Russian.

Belarusian doesn't have a language-specific analyzer. It uses the Elasticsearch "default" analyzer, which is probably what most wikis should be using, rather than these fallbacks.

Elastic has analyzers for these languages: Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Turkish, Thai. All others should probably be using "default".

Thanks for the crash course in comparative East Slavic!

I ran the Russian and Ukrainian forms through the Russian analyzer to see what happens. Stemmed forms are below—and I numbered the stemmed forms of Нести for those whose Cyrillic pattern matching is not up to the task. I'm surprised the analyzer doesn't do better on the Russian verbs.

CaseRussian (cat)StemmedUkrainian (cat)StemmedRussian (cats)StemmedUkrainian (cats)StemmedRussian (book)StemmedUkrainian (book)Stemmed
NominativeКоткотКіткітКотыкотКотикотКнижкакнижкКнижкакнижк
GenitiveКотакотКотакотКотовкотКотівкотівКнижкикнижкКнижкикнижк
DativeКотукотКотукотКотамкотКотамкотКнижкекнижкКнижцікнижці
AccusativeКотакотКотакотКотовкотКотівкотівКнижкукнижкКнижкукнижк
InstrumentalКотомкотКотомкотКотамикотКотамикотКнижкой (книжкою)книжкКнижкоюкнижк
LocativeКотекотКотікотіКотахкотКотахкотКнижкекнижкКнижцікнижці
Tense/PrepositionRussian (to bring)StemmedUkrainian (to bring)Stemmed
InfinitiveНести(1) нестНести(1) нест
Present/IНесу(2) несНесу(2) нес
Present/WeНесём(2) несНесемо(7) несем
Present/YouНесёшь(3) несешНесеш(3) несеш
Present/You (pl.)Несёте(4) несетНесете(4) несет
Present/He/She/ItНесёт(4) несетНесе(2) нес
Present/TheyНесут(5) несутНесуть(5) несут
Past/HeНёс(2) несНіс(8) ніс
Past/SheНесла(6) неслНесла(6) несл
Past/ItНесло(6) неслНесло(6) несл
Past/TheyНесли(6) неслНесли(6) несл
Future/IБуду нести(1) буд нестБуду нести, нестиму(1) буд нест, нестим
Future/WeБудем нести(1) буд нестБудемо нести, нестимемо(1) будем нест, нестимем
Future/YouБудешь нести(1) будеш нестБудеш нести, нестимеш(1) будеш нест, нестимеш
Future/You (pl.)Будете нести(1) будет нестБудете нести, нестимете(1) будет нест, нестимет
Future/He/She/ItБудет нести(1) нестБудете нести, нестиме(1) будет нест, нестим
Future/TheyБудут нести(1) будут нестБудуть нести, нестимуть(1) будут нест, нестимут
Imperative/YouНеси(2) несНеси(2) нес
Imperative/You (pl.)Несите(2) несНесіть(9) несіт

So, yes, it seems that Ukrainian would benefit from using Russian analyzer (or Belorussian?). When I started to write this message I was unsure, now I'm sure. Although some people may get confused: «Why searching for „коту“ will find „кота“, but won't find „кіт“?» — it's better than nothing.

I'm tending to agree with you, and since it's the status quo, we'll certainly leave it for now. Thanks for helping me get a much better understanding of the linguistic situation here!

In general "recommended" apostrophe for Ukrainian probably should be 02BC (due to it being part of the word), also 02BC is approved apostrophe character for Ukrainian in internationalized domain names. But majority of the Ukrainian texts out there are using 027 and (a bit less) 2019, and it probably will stay this way for long time as majority of the users will have only ' on their keyboards (and some word processors may change it to 2019). I would say we do want to support 02BC same way we do for 027 and 2019.

That's definitely the goal. And of course we can't dictate what people use on any of the wikis—that's up to the community to determine. My goal is to figure out what people actually do and make search do the best it can in that context.

Also although Ukrainian is close to Russian it has quite a bit of specifics, as @Sasha1024 already pointed out: different inflections, in fact inflections more complicated (with alternating letters in the stem) so using Russian analyzer may produce poor results.

Definitely—that's what I'd normally expect, even in closely related languages. What @Sasha1024 and I were discussing is whether or not the Russian analyzer could do anything useful at all. Surprisingly to both of us, it seems like it kind of does.

We have pretty good NLP for Ukrainian in LanguageTool (https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/uk), and we just added support for lemmatizing stemmer for Ukrainian in Lucene (https://issues.apache.org/jira/browse/LUCENE-7287).
I am wondering if we should take a look if we should create separate analyzer for Ukrainian.

Hey, that's cool! I'm not sure what our policy and procedure is for installing components from outside the Elasticsearch core, but I'll look into it.

In terms of immediate action, I'm going to stick to the the scope of this task (apostrophe-like characters) and try to get that doing the right thing before the big re-index we have coming up, but I very much appreciate all the help and information. I have a much better grasp of our technical details (all those unexpected fallbacks!) and the linguistic details of Ukrainian.

As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?

As I understand once the next version of Lucene is released the Elasticsearch will have Ukrainian analyzer accessible. Would we need to create another ticket here at phabricator to switch to it for Ukrainian?

That would be best. We'd have to update Elastic and Lucene to the relevant version, and then make sure the Ukrainian analyzer is available internally (some configs may need to be tweaked) and then enable it and re-index. That's not going to happen automatically—particularly the re-indexing—so a new ticket to enable the Ukrainian analyzer would be good.

(I'll be keeping an eye out for it, too—but a nudge never hurts.)

Okay, I've opened the ticket to track the Ukrainian analyzer: T148051

Change 315837 had a related patch set uploaded (by Tjones):
Improve processing of the apostrophe by the search engine in Ukrainian

https://gerrit.wikimedia.org/r/315837

Change 315837 merged by jenkins-bot:
Improve processing of the apostrophe by the search engine in Ukrainian

https://gerrit.wikimedia.org/r/315837