Test and analyze new Ukrainian language analyzers
Open, NormalPublic

Description

After the research in T160105 has found some analyzers for Ukrainian that are potentially better, we will test them, and analyze to see if they are better or not. If they are, we will file a task to deploy one of them.

If we do decide to deploy one, then back out whatever parts of the Ukrainian/Russian hacks in T146358 that aren't needed anymore.

TJones edited the task description. (Show Details)Mar 9 2017, 9:55 PM
TJones claimed this task.Mar 10 2017, 3:05 PM
TJones moved this task from Backlog to In progress on the Discovery-Search (Current work) board.

Should perhaps wait for ES5 in vagrant for proper testing. I can start the baseline/prod data gathering now, and I can probably use the pre-ES5 version on vagrant to test my code.

Hi, I am author and maintainer of the Ukrainian dictionary that is used in Lucene's Ukrainian analyzer, and I'd like to note two things:

  1. this analyzer is very close to Polish one - both use dictionary in morfologik format (and both of them are used for grammar checking in LanguageTool), so if Polish worked I have high hopes for the Ukrainian one as well
  2. I'd like to hear any problems that may arise from using this analyzer, hopefully we can address most of them (though as I understand if we fix them in Lucene it may take a while to get them here)

Hi @dalekiy_obriy—that's great to hear! I will definitely let you know of
any problems we run into. Thanks for the offer to help!

I'm asking @dalekiy_obriy, @Sasha1024 (thanks for help on T146358), @Smalyshev, and anyone else who is interested to please help with reviewing changes we're considering making to the way Ukrainian text is indexed for searching on Ukrainian-language wiki projects. The goal is for different forms of a word to be indexed together, so that searching for one will find the others, such as вертоліт, вертольота, and вертольотів. The process isn't perfect, so there are always errors, but the goal is for the good to largely outweigh the bad.

In the past we discovered that the Russian language processing was being applied to Ukrainian projects. After some analysis, we decided to leave it, because it helped some. Now that there is a proper Ukrainian analyzer available, we want to switch to it.

The most important thing to do is check that words are being grouped together correctly. A random sample of 100 groups has been posted here. The numbers before each word indicate how many times that exact word occurred in a corpus of over 1 million words. If you could review them and let me know of any that look bad, that would be great!

There are also a few groupings that include words that don't have any beginning or ending letters in common. This could be an error, or it could be something like English good/better/best or be/is/am/was where the normal patterns don't hold. There are 8 groups like this. I've checked them and they seem reasonable to me, but if you could review them, that would be very helpful, too! They are here.

Just below that is a section on "Large groupings", which only has one group: forms of мати, which I understand is both "mother" and "to have"—which leads to a big weird overlapping group. Could you take a look at that, too, and see if there's anything in there that isn't a form of мати?

Finally, least important but still helpful, there are some groups that have had words removed from them. I expect these to be words that are related to each other in Russian, but not in Ukrainian. That could be because they are actually Russian words, or because they are words in Ukrainian (or possibly other languages) that happen to look like they are related according to Russian grammar rules. As a similar kind of example, consider "chaos" and the name "Chao" in English—if you didn't know any better, "chaos" looks like it could be the plural of "chao", but it isn't. We want to make sure these are mostly Russian, and not some widespread error on Ukrainian words. There is a sample of 50 words and the group they were removed from, posted here.

Feel free to look over the rest of the analysis and ask any question or make any suggestions!

Piramidion added a comment.EditedMar 31 2017, 1:22 AM

As for the first sample of 100 groups, it looks OK. Some of the groups seem to contain a mix of abreviations and regular words (like [19 ПАР][2 Пара][1 Пари]... where the first one seems to be an abbreviation for the Republic of South Africa), but generally there are no errors.

The second group of irregular words is fine too, and the explanations attached to them are correct.

As for the large grouping, the word "мало" may represent two meanings: a form of the verb "to have" (it had) or a separate word meaning "few"; also "мала" may mean either "[she] had" or "[she is] small". The first has a stress on the first syllable ("ма́ла"), the second – on the second syllable ("мала́"), so "мала́" is definitely a false positive here. But that's not too relevant (usually we don't use stresses).

I don't know what to say about the last group. In most of the cases the words on the left contain some mistakes. There are a lot of proper nouns like Дубович (right column), with some mistakenly uncapitalized forms of them appearing on the left column (дубовича). I don't know if this is an intended behavior or this is an error of some kind, but there are a lot of these. There is also a suspicious case of [1 інвентаря] ← [1 інвентар][1 інвентарю] – both «інвентаря» and «інвентарю» are genitive forms of «інвентар», but represent a slightly different meaning (this might be a dictionary problem).

As for the first sample of 100 groups, it looks good, I would say it's almost perfect (if I say so myself :)). I agree with @Piramidion here that the only flaw is merging abbreviations with normal words. The reason for this is that common approach in Lucene anayzers is to lowercase the text first and then do the stemming so we can't use the case as a help. We actually experimented with lemmatizing first and then converting to lowercase but this approach has lots of limitations and is not acceptable.

The second group or irregular words is good and shows the best part of dictionary-based lemmatizer - it can provide correct lemma even if there are changes in the beginning or middle of the word (which there are more in Ukrainian than in Russian). The only non-perfect case here is "стели"/"шлем" - they "fold" into the same lemma-homonym "слати" (1st word from the meaning "to lay down", second is from "to send").

For the large group with "мати" - it's correct, though we could probably help this case by adding most of the verb forms of "мати" to our stopwords.

The last group is a mixed bag, here are main points:

  1. many common nouns were Russian (анализ, гостинец, пород, майоров...)
  2. some words are typos or mistakes (e.g. воротаре in uk.wikipedia.org one time used with a typo - should have been "воротарем", second type it's a wrong form of vocative - should have been "воротарю")
  3. words like "Гідрохлоротіазид" shows the weakest side of dictionary-based lemmatizer, if the word is not there lemmatizing does not work; dictionary is now getting close to 300K lemmas so even though it can't contain all words it should be good enough for most cases
  4. "інвентаря" is an interesting case, theoretically the dictionary we found we can trust the most (http://www.mova.info/grmasl.aspx) says only "інвентарю" is correct, but some other dictionaries provide second meaning with "інвентаря"; our dictionary has a base of "grammatically correct words" but we add many "not-so-correct" words that are actually used out there so we can add this one too
  5. the last problem is related to the dictionary - we just found out that there is a problem lemmatizing some of the proper nouns with Ukrainian analyzer in Lucene, we've prepared the fix and I'm planning to create a merge request into Lucene early next week (both 7.x and 6.x branches); if it helps the testing I could provide new dictionary files

As for the stress character - people very use stress very infrequently in real texts so we can't rely on it in the analysis and thus we just ignore it.

As for your note about "zero-width non-breaking spaces (U+FEFF), soft hyphens (U+00AD), and left-to-right/right-to-left markers (U+200E / U+200F)" - I can definitely add them to ignored characters in Ukrainian analyzer in Lucene but I am not sure how easily you can pull changes from new versions (unless you can recompile lucene from github).

As for configurability - I am not an expert on analyzers in Lucene but if exposing some internals can be done and you can point to other analyzers that expose things that help, I can adjust Ukrainian analyzer the same way.

P.S. if you ever need to check word inflections for Ukrainian I'd suggest http://lcorp.ulif.org.ua/dictua/ - it has some mistakes here and there but in general it's the largest free inflection dictionary (UI is kinda clunky and clicking on hyperlinks may not work correctly on some platforms, e.g. Firefox on Linux).

Not much for me to add here, agree with the above. Only one note about "unexpected" part - in Ukrainian, it is common to have о/в or у/в switches in the forms of the same word (happens in Russian too but more common in Ukrainian AFAIK) so do not be alarmed by those, they are normal :)

Thanks, @Piramidion!

Some of the groups seem to contain a mix of abreviations

Acronyms that look like words are always an issue. Most analyzers lowercase everything anyway, and of course words can be in all caps even when they aren't acronyms, so I expect that kind of thing to happen. (Oh, I see that @dalekiy_obriy said pretty much the same thing.)

so "мала́" is definitely a false positive here

Like lowercasing, the stress marks are stripped before analysis, so even if they are useful, they are ignored. I've read they are generally only used as a pronunciation guide, and we went out of our way to ignore them in T146358. (That also got accidentlaly reverted by the recent upgrade to Elasticsearch 5, though enabling this analyzer would fix it.) Is it rare to have words that differ only by stress?

There are a lot of proper nouns like Дубович (right column), with some mistakenly uncapitalized forms of them appearing on the left column (дубовича).

Okay, that's my fault. Looking for these kinds of splits is a new element of my analysis, and I must have lowercased it. Since most words are not names and I don't read Ukrainian, I didn't notice the names had been lowercased. I checked, and the original text was Дубовича, not дубовича.

Still sounds like a problem with forms of names not being stemmed together.

інвентаря

Yeah, that looks like it might be an error, but of course there are going to be errors here and there. Overall, it seems like a net improvement!

Thanks for the insider view, @dalekiy_obriy! Most of the points you bring up are the kind of thing I expect. It's not perfect, but language is so incredibly messy in general that a little mess around the edges is expected.

words like "Гідрохлоротіазид" shows the weakest side of dictionary-based lemmatizer

On the other hand, statistical stemmers/lemmatizers have their own problems. You mentioned the Polish analyzer before—it has worse problems I think, because the statistics can go crazy. Personally, I prefer obvious false negatives to horrible false positives. In some cases a rule-based fallback can help, but it depends on how ambiguous things are and how hard it is to tell what category something is. (And when it doesn't work, you get chaos/Chao.)

if it helps the testing I could provide new dictionary files

Overall, things are looking very good, so I don't think it's a show stopper to have a few mistakes here and there (others should feel free to jump in and disagree about the severity of any problem, though!). I don't want to get into a situation where we make weird and unexpected maintenance problems for ourselves. Having a custom compiled version could lead to such problems.

So far the current version seems much better than what we have and future improvements would only make it better.

I am not an expert on analyzers in Lucene but if exposing some internals can be done and you can point to other analyzers that expose things that help, I can adjust Ukrainian analyzer the same way.

I'm not an expert on developing Elasticsearch plugins either. But it seems that the Polish analyzer is a decent model, since it uses Morfologik, too. If you look the GitHub repo for the Stempel Polish plugin, plugin/analysis/stempel/AnalysisStempelPlugin.java defines "polish_stem"—which is the core of what it's doing. This allows users to "unpack" the analyzer and use the core functionality in custom ways. (Elastic gives the config to unpack each of their core analyzers here. I haven't tested in ES5, but in earlier versions there were some minor differences after unpacking, but they could be fixed.)

The Japanese Kuromoji plugin also exposes it's stopword list, if that model helps.

So, as a user of the plugin, being able to unpack and customize the analyzer is awesome.

BTW, I feel like all tokenizers should be smarter about zero-width non-breaking spaces and the left-to-right marks, but it's not usually a common problem, and there may be use cases where stripping them could be bad. That's why unpacking the analyzer and customizing it is nice. I get all the benefit of the language-specific lemmatizer and stop word list, but I can customize character filters and other stuff to solve my specific problems.

Thanks, @Smalyshev! Good to know about the regular alternations. The bar for "unexpected" is very, very low. It's just one way to find potential weird stuff. Polish had a bunch, but Ukrainian doesn't have any!

Is it rare to have words that differ only by stress?

I'd say it is for Ukrainian. I didn't find too many pairs. Besides, we seldom use stresses: the meaning of such words is understood from the context, and without the stress marks you won't be able to programmatically tell the difference.

Ata added a subscriber: Ata.Fri, Mar 31, 7:02 PM
Ata added a comment.Sat, Apr 1, 3:56 PM

Just to add a few trifles:
Random sample of groupings: Живите, живила, живить are forms of the verb to nourish, to feed for living, while живим is a form of adjective alive (btw, живимо would be a verb form as well). These are just different parts of language, so this is what was expected, right?
Unexpected groupings: Стели, стели, стелю vs шлем -- I don't get, why шлем is in this group.
Stemming splits: (beyond my understanding right now, sorry).

Sasha1024 added a comment.EditedSat, Apr 1, 4:36 PM

@Ata

  • стели іs: (1) singular genitive of сте́ла ("stele"), (2) imperative mood of стели́ти ("to lay down");
  • стелю is: (1) first person singular of стели́ти ("to lay down"), (2) singular dative of сте́ля ("ceiling");
  • слати is: (1) "to send", (2) alternative (archaic?) form for стели́ти ("to lay down");
  • шлем is: (1) alternative of шлемо́, which is first person plural of сла́ти ("to send"); (2) colloquial for шоло́м ("helmet").

That's why all forms of сте́ла ("stele"), стели́ти ("to lay down"), сте́ля ("ceiling"), шлю ("to send") and шоло́м ("helmet") may be grouped together. It looks weird (at first impression) — but AFAIK we have no other way of doing this at this technology level (without context-dependent analysis).

Piramidion added a comment.EditedSat, Apr 1, 4:46 PM

I don't get, why шлем is in this group.

It's a (poetic) form of "шлемо", can be found here under "слати". The same is generally true for all or nearly all the other words ending in -мо: пам'ятаємо/пам'ятаєм, пронизуємо/пронизуєм etc. (took too long for me to write this)

TJones added a comment.Mon, Apr 3, 3:04 PM

Thanks for the comments, everyone, especially @Sasha1024's nice breakdown of the стели group.

AFAIK we have no other way of doing this at this technology level (without context-dependent analysis).

Exactly—there's no way to deal with overlaps between forms of words without parsing or some other context-dependent analysis. It happens in every language—the worst in English as does as the plural of doe "a female deer" and can as a verb and a noun.

Unless it causes major problems, you just live with it. The only other straightforward option is to split the grouping and give the ambiguous form to one group or the other. In English, can and does are usually treated as verbs (and stop words), while cans and doe are clearly nouns.

Thanks for all the feedback, everyone. This has been a very productive discussion!

More comments still welcome, but I think we can move ahead with deployment!

Change 346168 had a related patch set uploaded (by Tjones):
[mediawiki/extensions/CirrusSearch@master] Enable Ukrainian Elastic/Morfologik Language Analyzer

https://gerrit.wikimedia.org/r/346168

Change 346168 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Enable Ukrainian Elastic/Morfologik Language Analyzer

https://gerrit.wikimedia.org/r/346168