Page MenuHomePhabricator

Cimean Tatar transliteration has trouble with ё, ь, э, ю
Closed, ResolvedPublic

Description

The transliteration script does not work correctly. It often ignores Cyrillic letters ё, ь, э, ю. You can see a sample here https://crh.wikipedia.org/wiki/Qullan%C4%B1c%C4%B1:Don_Alessandro/Translit/Sample

Event Timeline

@Framawiki: Removing too broad MediaWiki-General as this likely is MediaWiki-Language-converter. If it is not (e.g. other custom local method used that we are not aware of) it might be MediaWiki-Internationalization.

Now transliteration script does not work correctly.

@DonAlessandro: Hi, can you please provide steps that allow someone else to reproduce the problem? See https://mediawiki.org/wiki/How_to_report_a_bug - thanks!

I do understand what @DonAllessandro is getting at. When you use the automatic Latin-to-Cyrillic transliteration on this page—Qırımtatar sürgünligi—you get the transliteration under "What we have now", while the correct transliteration is under "What we are to have", with the diffs/errors bolded.

It would be possible to group the words that get errors into a more focused list, but you can see what's going on.

I noticed this pattern before, in my initial analysis:

Patterns of Errors: Of the 704 types that had transliteration errors, most of them involve ю/у, ё/о, э/е, or ь. A smaller number involve ц/тс, щ/шч, and ъ/ь. Clearly those are the hard letters to transliterate to.

The most frequent error substitutions were:

FreqTransliteratedParallel
1109юу
159ёо
146ьYüz
115тюьту
67юшюушу
55эе
28ьдюду
24ютюуту
15юрлюурлу
12ёзюозу
10юкюуку

We corrected the most common individual words, but we didn’t come up with any new patterns indicating where the transliteration was predictably wrong.

  • ю: Looking at the code I see that other than specific exceptions, only yu and yü are mapped to ю, so words like sürgünligi gets mapped to сургунлиги, not сюргюнлиги.
  • ё: Similarly, only yo and yö are mapped to ё, so prodrazvörstka gets mapped to продразворстка, not продразвёрстка.
  • э: The only rules for e getting mapped to э is when it follows another vowel, so the many instances where it starts a word, like etti, gets mapped to етти, not этти.
  • ь is harder, since it seems that it is generally not represented in the Latin script. There are a lot of rules that insert it here: ONьC — where O is ö or ü; N is one of ç, n, r, s, t, z; and C is any consonant — or here VlьC? — where V is a front vowel (e, i, ö, ü); and C? is either a consonant or the end of the word. There are more details, but that’s the basic gist of it. So words like faal gets mapped to фаал, not фааль, because a is not a matching vowel.

    Other cases like etilgen, which gets mapped to етилген, rather than этильген (focus on the ь, not the е/э, which is as above), seem to be in error. the i before the l should trigger adding the ь, but the rules are ordered and it’s possible the i has already been mapped to и and the pattern isn’t matching. I did encounter at least one other case like that.

So, other than the ль case, which I can investigate, the transliteration seems to be at least mostly behaving according to the original patterns. Our options are to (i) modify the existing patterns or add new ones that capture the generalities we missed, or (ii) add a bunch of exceptions, which will never cover everything, but will guarantee correct results for the most common errors.

I worry that with too many exceptions, performance will suffer, so I'd prefer patterns, but I don't know what they would be.

A - one of the letters b, c, g, k, p, ş
B - one of the letters ç, n, r, s, t, z
С - one of the letters b, c, ç, d, f, g, ğ, h, j, k, l, m, n, ñ, p, q, r, s, ş, t, v, y, z
D - one of the letters a, â, e, ı, i, o, ö, u, ü, а, е, ё, и, о, у, ы, э, ю, я
E - one of the letters e, i, ö, ü
_ - begining of the word

The patterns are as follows (order is important!):

ElC => ElьC

_yüB => _юBь

_öB => _оBь
_üB => _уBь

_ö => _о
_ü => _у

_AöB => AоBь
_AüB => AуBь

_Aö => Aо
_Aü => Aу

Cö => Cё /this regexp is definitely missing now/
Cü => Cю /this regexp is definitely missing now/

yo => ё /this regexp is definitely missing now/
yö => ё /this regexp is definitely missing now/
yu => ю /this regexp is definitely missing now/
yü => ю /this regexp is definitely missing now/

_e => _э
De => Dэ
Cye => Cье
ye => е
е => е

кьк =>кк
льл => лл
ньн => нн
рьр => рр
сьс => сс
тьт => тт

I can also provide the JavaSrcipt, wich contains all necessary regular expressions and full list of exeptions (or you can obtain it yourself by saving this page https://medeniye.org/en/node/531 ).

TJones renamed this task from Cimean Tatar transliteration script to Cimean Tatar transliteration has trouble with ё, ь, э, ю.Feb 8 2018, 3:22 PM
TJones updated the task description. (Show Details)

@DonAlessandro —thanks for all the explicit detail. I know that you know what's supposed to be happening!

For full details of the current implementation, you can check out the code on GitHub:

I may have introduced some bugs, especially in ordering or the regexes. when I refactored the original PHP code to get it running in the current version of the Language Converter framework. I'll investigate the current known bugs and compare the Javascript to the current PHP implementation and see if I can improve it. Please point out any problems you find in the code linked above.

@TJones, as far as I can see, "public $mCyrillicToLatin" and "public $mLatinToCyrillic" are implemented BEFORE all regexes and exeptions, but they are to be AFTER. At first we transliterate all exeptions, then all these sophisticated regexes and only then do "ordinary" replacements (a => а, b => б, etc.)
Could you please test how it will work if the right order is set.

Thanks, @DonAlessandro—I'll take a look at that. I also noticed that one specific exception case wasn't firing correctly.

I've found the problem, but haven't figured out how to fix it yet.

The exception list is not being loaded, so lots of exceptions (which tend to feature ё, ь, э, and ю) are not being converted correctly. Interestingly, the exception list is being loaded during unit testing, so that doesn't catch it.

There's some complicated processing going on to make sure the exceptions and other tables aren't loaded repeatedly, but somehow it's blocking them from getting loaded the first time.

This may also explain @DonAlessandro's weird experience on Meta where an English banner was transliterated "correctly" when his interface was set to CRH—if no other transliteration tables had been loaded, the CRH tables could have loaded cleanly. (Just a guess there.)

I didn't catch this during development because there's some other caching that happens, and I temporarily disabled as much caching as possible so that all my transliteration config changes came through without having to re-edit a page. And, of course, the unit tests work! Argh.

I'll see if I can figure out what's blocking the exception tables from loading properly and get a patch up as soon as possible.

The exception list is not being loaded

You mean exceptions AND regexes? If regexes are applied 99% of all texts will bi trsliterarted correctly.
Actually, most of the words you added to the exeption list (e.g. 'гонъюлли' => 'göñülli', 'дёрдю' => 'dördü', 'этюв' => 'etüv') are to be tranliterated properly without being there, because they fit one of the patterns.

I wouldn't be surprised if the regexes aren't loading correctly either, since all the tables are loaded at the same time—the exception list was the first obvious problem I found while debugging.

Okay, using the example @DonAlessandro provided, I've been able to fix most of the errors. I also found two major bugs.

First, the bugs. When converting @DonAlessandro's original code, I made an editing error and did not map Öö and Üü correctly—which explains why there are so many Ёё and Юю errors.

I've fixed those mapping bugs and I fixed the table loading problem by copying more carefully from the Kazakh converter.

There are two types of disagreement left in the sample, and there are questions for @DonAlessandro, especially in (1).

  1. Partial word matches for exceptions: This is the most common cause of differences. As an example, my PHP code transliterates "rayonlarında" as раёнларында while your JS code gives районларында.

    The difference is caused by the way I implemented some items in the original code, which has an uncalled table of exceptions which included things like "rayon"=>"район". I interpreted these as whole words, while your JS code treats them as patterns (/rayon/g —> район).

    This applies to rayon/район, işğal/ишгъаль, lager/лагерь, faal/фааль, ial/иаль, and thousands of others. I can change the way these are handled, though it will be much slower, since I do a hash lookup now, rather than running low thousands of regular expressions.

    I was originally worried about some of the words that would match some of the shorter patterns. For ial crhwiki matches: Sotsialistik, territorial, Vaqialar, Ekvatorial, materiallar, Official, Bialıstok, iddiaları, Rialı, filialı, Imperial, Tercimeial. Do those all look right?

    Should there by any limitations, such as matching only at the beginning or end of a word? Or do they match everywhere?

    (This also affected "ŞSCBnen", which my code transliterated as ШСДЖБнен, rather than ШСДжБнен because the ŞSCB/ШСДжБ mapping currently only applies to the whole word. I originally thought this was an acronym/all-caps problem, but then I found ŞSCB in the exception list.)
  1. Complex regexes: There are some really complex regexes like `/([\s"'\(\)\-.,:;!?>\]\/])([yY])ü([çnrstzÇNRSTZ])([\s"'.,:;!?\)\-\[<aAuUbcçdfgğhjklmnñpqrstvyzBCÇDFGĞFHJKLMNÑPQRSTVYZ])/g` which I tried to adapt to deal with the fact that I'm only sending single words to be translated most of the time. I'll have to look at these more carefully and figure out what should be happening and where I messed up. Right now I only have one example (yüz -> юз/юзь) but I'll figure it out and then look for others.

If we really should do all partial matches for (1), then I may commit what I've got before working on it more. It is much less important than the Öö / Üü -> Ёё / Юю errors, and it may require some additional work to make it sufficiently performant.

Crimean Tatar (as well as other Turkic languages) has a plenty of affixes, that can be added to the end of a word. So, one root can produce hundreds, or may be thousands of forms, e.g. rayon, rayonda, rayonnıñ, rayonnıñki, rayonımız, rayonlar, rayonlarımız, , rayonlarımızda, rayonlarımızdaki, rayonlarımızdakiler, rayonlarımızdakilerden, rayonlarımızdakilerdensiñ, rayonlarımızdakilerdensiñmi, rayonlarımızdakilerdensizimi, rayonlarımızdakilerdenlermi, etc., etc., etc (but all of them begin with rayon- as Turkic languages has virtually no prefixes). So, it is impossible to include all forms "produced" by a single root to exeption list. That is why words from the exeption list are to be treated as patterns matching only at the beginning of a word.

For ial crhwiki matches: Sotsialistik, territorial, Vaqialar, Ekvatorial, materiallar, Official, Bialıstok,
iddiaları, Rialı, filialı, Imperial, Tercimeial. Do those all look right?

Yes, because after ial => иаль is done unnecessary Ь's will be removed by the other regex, which removes Ь before vowels and between two л, н, р, с, т. E.g. "vaqialar" is transliterated as follows: vaqialar > vaqиальар > вакъиальар > вакъиалар
By, the way, tercimeial => терджимеиал should be added to exeption list, as it has no ь.
But if we limit this concrete regex to the end of the word, it will be OK, too.

I originally thought this was an acronym/all-caps problem

С, Ğ, Q, N in abbreviations are to be transliterated as Дж, Гъ, Къ, Нъ (e.g. QMC => КъМДж). So, I added some popular abbreviations (like QMC = Qırım Muhtar Cumhuriyeti = Autonomous Republic of Crimea) to the list. The problem is that affixes can be added to abbreviations, too.

There are some really complex regexes like

This one is actually for three words: yün, yür, yüz and all of their forms like yüzler, yünlü, yürgenleri, etc. So we can add these three to the exeption list. But I have looked through the regexes, this is the only one apllied to such a little amount of words.

Change 414728 had a related patch set uploaded (by Tjones; owner: Tjones):
[mediawiki/core@master] Fix table loading bug for CRH transliteration

https://gerrit.wikimedia.org/r/414728

Based on my time constraints and the severity of the bugs I found and fixed, I've submitted the patch above, which fixes the incorrect Ö/ö -> Ё/ё and Ü/ü -> Ю/ю mappings and which should have the regex and exception tables loading properly in production.

I'll open a new ticket for treating the exceptions as patterns rather than whole words and for investigating the complex yün, yür, yüz regex—the solution to both of those may be related.

Change 414728 merged by jenkins-bot:
[mediawiki/core@master] Fix table loading bug for CRH transliteration

https://gerrit.wikimedia.org/r/414728

I believe this should go out in next week's deployment, and should be live next Thursday if all goes well.

"Thursday" in California may be Friday across the Atlantic, but it looks like the patch has been deployed (crhwiki is in "Group 2" and the patch is in "1.31.0-wmf.24" which is now live on Group 2), and there are many fewer errors in @DonAlessandro's example.

However, I'm seeing errors in some easy mappings of ö→ё and ü→ю, particularly in sürgünligi and prodrazvörstka which, along with others, are on my example page. I've purged the cache for the page and edited it, both of which should make sure it gets freshly transliterated and not using a cached version.

The errors don't make sense because a lot of the transliteration work is being done by the basic Cyrillic-to-Latin mapping, and I double checked that the ö→ё and ü→ю fixes are in the patch, and clearly the exception table is working because the faal→фааль exception is happening correctly, so the patch really has been deployed. And of course it works fine on my development test wiki on my laptop.

There could be some code caching that I'm not aware of that's causing a cached version of the Cyrillic-to-Latin mapping to be used, but that seems unlikely. I'll ask some people who may know more than me about it, and see if they have any insights. If it is a code caching problem, it could resolve itself in a few hours, a day, or a week—I'm not expecting that to be the case, but I'll check first thing tomorrow anyway.

I'll keep this ticket open until I get to the bottom of this. (Bummer—I was hoping to close it today after the deployment.)

Woo hoo! It looks like either waiting was the answer, or some helpful WikiGnome purged a cache somewhere.

I'll be working on T188321: CRH Transliteration pattern matching fixes next.

Though this ticket is closed, I wanted to document what happened. There is a 12-hour cache for some of the components in the transliteration. It's possible to purge the cache, but waiting 12 hours also works. In this case, none of the changed pieces were incompatible, so waiting was a reasonable answer, even if it was a bit less satisfying.