Page MenuHomePhabricator

Implement Latin->Cyrillic transliterator for Serbo-Croatian
Closed, ResolvedPublic

Description

Implement transliterator from Latin to Cyrillic and v.v. for Serbocroatian in similar fashion as it is done for Serbian. This is internally in Mediawiki called Language converter.

Link to relevant discussions on Serbocroatian wikipedia:
https://sh.wikipedia.org/wiki/Wikipedia:Pijaca-%D0%9F%D0%B8%D1%98%D0%B0%D1%86%D0%B0/%D0%9E%D0%BC%D0%BE%D0%B3%D1%83%D1%9B%D0%B0%D0%B2%D0%B0%D1%9A%D0%B5_%D1%9B%D0%B8%D1%80%D0%B8%D0%BB%D0%B8%D1%86%D0%B5

I was trying to resolve this myself but didn't have enough success. I'm not giving up on this so I'm including myself as being assigned to the task but any help would be much appreciated since I'm not sure I will be able to do it.

Event Timeline

Is it intended to have sh-Cyrl (Serbocroatian in Cyrillic script) and sh-Latn (Serbocroatian in Latin script) both as content language with automatic transliteration and as user interface language? Should the existing messages in sh moved to sh-Latn? Should sh-Latn be the default fallback for sh? Is the mapping for sh the same as for sr (code)?

/**
 * @var string[]
 */
public $mToLatin = [
        'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd',
        'ђ' => 'đ', 'е' => 'e', 'ж' => 'ž', 'з' => 'z', 'и' => 'i',
        'ј' => 'j', 'к' => 'k', 'л' => 'l', 'љ' => 'lj', 'м' => 'm',
        'н' => 'n', 'њ' => 'nj', 'о' => 'o', 'п' => 'p', 'р' => 'r',
        'с' => 's', 'т' => 't', 'ћ' => 'ć', 'у' => 'u', 'ф' => 'f',
        'х' => 'h', 'ц' => 'c', 'ч' => 'č', 'џ' => 'dž', 'ш' => 'š',

        'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G', 'Д' => 'D',
        'Ђ' => 'Đ', 'Е' => 'E', 'Ж' => 'Ž', 'З' => 'Z', 'И' => 'I',
        'Ј' => 'J', 'К' => 'K', 'Л' => 'L', 'Љ' => 'Lj', 'М' => 'M',
        'Н' => 'N', 'Њ' => 'Nj', 'О' => 'O', 'П' => 'P', 'Р' => 'R',
        'С' => 'S', 'Т' => 'T', 'Ћ' => 'Ć', 'У' => 'U', 'Ф' => 'F',
        'Х' => 'H', 'Ц' => 'C', 'Ч' => 'Č', 'Џ' => 'Dž', 'Ш' => 'Š',
];

/**
 * @var string[]
 */
public $mToCyrillics = [
        'a' => 'а', 'b' => 'б', 'c' => 'ц', 'č' => 'ч', 'ć' => 'ћ',
        'd' => 'д', 'dž' => 'џ', 'đ' => 'ђ', 'e' => 'е', 'f' => 'ф',
        'g' => 'г', 'h' => 'х', 'i' => 'и', 'j' => 'ј', 'k' => 'к',
        'l' => 'л', 'lj' => 'љ', 'm' => 'м', 'n' => 'н', 'nj' => 'њ',
        'o' => 'о', 'p' => 'п', 'r' => 'р', 's' => 'с', 'š' => 'ш',
        't' => 'т', 'u' => 'у', 'v' => 'в', 'z' => 'з', 'ž' => 'ж',

        'A' => 'А', 'B' => 'Б', 'C' => 'Ц', 'Č' => 'Ч', 'Ć' => 'Ћ',
        'D' => 'Д', 'Dž' => 'Џ', 'Đ' => 'Ђ', 'E' => 'Е', 'F' => 'Ф',
        'G' => 'Г', 'H' => 'Х', 'I' => 'И', 'J' => 'Ј', 'K' => 'К',
        'L' => 'Л', 'LJ' => 'Љ', 'M' => 'М', 'N' => 'Н', 'NJ' => 'Њ',
        'O' => 'О', 'P' => 'П', 'R' => 'Р', 'S' => 'С', 'Š' => 'Ш',
        'T' => 'Т', 'U' => 'У', 'V' => 'В', 'Z' => 'З', 'Ž' => 'Ж',

        'DŽ' => 'Џ', 'd!ž' => 'дж', 'D!ž' => 'Дж', 'D!Ž' => 'ДЖ',
        'Lj' => 'Љ', 'l!j' => 'лј', 'L!j' => 'Лј', 'L!J' => 'ЛЈ',
        'Nj' => 'Њ', 'n!j' => 'нј', 'N!j' => 'Нј', 'N!J' => 'НЈ'
];

Is it intended to have sh-Cyrl (Serbocroatian in Cyrillic script) and sh-Latn (Serbocroatian in Latin script) both as content language with automatic transliteration and as user interface language?

Well, if that's the usual terminology for the scripts in which the content is written (sh-cyrl and sh-latn are the two scripts of the Serbocroatian), then yes, those are to be the "content languages" for sh.wikipedia. The original (my) plan was to implement the same solution as for the sr (the mapping i.e. almost the whole code) and to have a similar two-way automatic transliteration (apart from the guessVariant function). After a bit of persuading from @Aca we concluded that a modified automatic transliteration similar to Tajik would suit better, a one-way transliterator, since the idea was to minimize the use of two scripts and in the long run have one script (Latin) but to have an option to read it in two. Second script (Cyrillic) would be allowed for content but if entered as content it would be gradually replaced by Latin. Actually, since Aca is doing all the necessary footwork for the converter stuff I should unassign this task from myself for him formally to take over.

Change 726595 had a related patch set uploaded (by Acamicamacaraca; author: Acamicamacaraca):

[mediawiki/core@master] Implement LanguageConverter for sh.wiki

https://gerrit.wikimedia.org/r/726595

Aca renamed this task from Implement Latin/Cyrillic transliterator for Serbocroatian to Implement Latin->Cyrillic transliterator for Serbo-Croatian.Oct 5 2021, 11:17 PM

What's the status of this as of now? I've got a concern I think hasn't been raised yet. While Serbian Wikipedia and its standard have complete phonemic orthography, i.e. transliterates and writes every foreign word exactly as pronounced in the language (effectivelly minimalising use of letters XYQW), Serbo-Croatian Wikipedia and other standards of the language make use of these letters when writing foreign words exactly as they appear in the foreign language (unless it's part of the language itself, then it's adapted).

I see a potential issue in that - how can we adapt these four letters into Cyrillic?

  • W=V ----> В
  • Q = mostly KV ----> КВ in words, but alone KJU ----> КЈУ ?
  • X = mostly KS ----> КС in words, but alone IKS or EKS ----> ИКС or ЕКС
  • Y = either I or J, depending on the word, i.e. Yoga = Joga / Rudy = Rudi

Apart from the problem with Y, problems could arise when somebody reads words in Cyrillic non-adapted either for the script or the standard they use (certainly, Serbian), so reading foreign words in e.g. album names, for example "Baby" as "Babi"/"Баби" or "Babj"/"Бабј" instead of "Bejbi"/"Бејби" would be a very terrible result. All of this only concerns foreign letters and words, because Cyrillic lacks these four foreign letters.

The partial solution is manually excluding such words from converting with -{ }-, as Serbian Wikipedia uses. So e.g. "Baby je sedmi studijski album benda Yello" ----> "Baby је седми студијски албум бенда Yello". Another partial solution is not converting these at all, though Latin Yy and Xx look too similar to Cyrillic Уу and Хх (Uu and Hh). I've seen "Star Wars" written as "Стар Wарс" for example. Another solution could be sacrificing other standards' preferences and writing everything using the Serbian transliteration standard, which would effectivelly ruin the purpose of common Wikipedia - all standards are welcome.

Album names (and, in fact, most English-language proper nouns) arе not transcribed even on the Serbian Wikipedia. For example, see this article.

No doubt, use of tags -{ }- is the best solution here.
As fot the patch, I'll rebase it. It's currently in the merge conflict.

Oh yeah, I should have used personal names as an example instead. For example - "George Washington" - do we write every its instance in -{Latin}-, adapt to the Serbian standard "Džordž Vašington" to use Cyrillic Џорџ Вашингтон, or let it auto-transliterate into Георге Wасхингтон? Or a harder example - "New York" - let it transliterate into Неw Yорк (without doing anything to W and Y), Нев Јорк (always transliterate W and Y into V and J), Њујорк (adapt every instance of New York to Njujork in Latin script), or exclude every instance from transliterating?

Anyway, it is desirable that SH Wikipedia finally gets this transliterator, issues like above will be dealt with later.

Change 851020 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] [POC] Implement LanguageConverter for sh, sh-latn

https://gerrit.wikimedia.org/r/851020

Winston_Sung changed the task status from Open to In Progress.Nov 3 2022, 8:28 AM

Change 853442 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Unbind main language variant code from language code with converter for LanguageConverterFactoryTest.php

https://gerrit.wikimedia.org/r/853442

Change 854114 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] Split "main variant language code" from "language code with converter" for LanguageConverter

https://gerrit.wikimedia.org/r/854114

Change 853442 merged by jenkins-bot:

[mediawiki/core@master] Split "main variant language code" from "language code with converter" for LanguageConverterFactoryTest.php

https://gerrit.wikimedia.org/r/853442

Change 854114 merged by jenkins-bot:

[mediawiki/core@master] Split "static default variant" language code from "language code with converter" for LanguageConverter

https://gerrit.wikimedia.org/r/854114

Change 726595 merged by jenkins-bot:

[mediawiki/core@master] Implement LanguageConverter for sh.wiki

https://gerrit.wikimedia.org/r/726595

Change 851020 abandoned by Winston Sung:

[mediawiki/core@master] [POC] Implement LanguageConverter for sh

Reason:

I6f3e7efe3630e9960584dca3a5ee55cb92ea722c merged

https://gerrit.wikimedia.org/r/851020

(Mark as resolved after deployed to Wikipedia-sh.)

Change 861398 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/services/function-schemata@master] definitions: Add Z1653/sh-cyrl and Z1669/sh-latn natural languages

https://gerrit.wikimedia.org/r/861398

Change 861398 merged by jenkins-bot:

[mediawiki/services/function-schemata@master] definitions: Add Z1653/sh-cyrl and Z1669/sh-latn natural languages

https://gerrit.wikimedia.org/r/861398

Change 861416 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861416

Change 861418 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861418

Change 861417 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861417

Change 861419 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861419

Change 861419 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861419

Change 861418 merged by jenkins-bot:

[mediawiki/tools/wikilambda-cli@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861418

Change 861417 merged by jenkins-bot:

[mediawiki/services/function-evaluator@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861417

Change 861416 merged by jenkins-bot:

[mediawiki/services/function-orchestrator@master] Update function-schemata sub-module to HEAD (a19a2e3)

https://gerrit.wikimedia.org/r/861416

Change 1006877 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[translatewiki@master] Disable sh, use sh-cyrl, sh-latn instead

https://gerrit.wikimedia.org/r/1006877