Page MenuHomePhabricator

Roman-Cyrillic converter for Erzya wikis
Open, Needs TriagePublicFeature

Description

Request Roman-Cyrillic converter for Erzya wikis (what you would like to be able to do and where):
We would like to be able to read and write in Erzya regardless of whether it is written using a Roman or Cyrillic script.

Write and read Erzya in both a Cyrillic and a Roman script
Background
There are two writing systems in use on Social Media, this a new dichotomy in the writing of the Erzya language, i.e. Cyrillic and Roman scripts. Erzya language literature has been written in Cyrillic script since the publication of the Gospel in 1821 and the remainder of the New Testament in 1827. In addition extensive fieldwork collections with some lexicography and printing of readers have been made by Finnish (Paasonen, E. Itkonen), Hungarian (Keresztes, Meszáros), Estonian (Päll, Aasmäe) and even Russian (Šaxmatov) scholars using a Roman script. Although there was minimal development of a Roman script in the 1930's, work began outside of Mordovia on establishing a Romanization of Erzya in the 1990's; centrals parties to this discussion were the Estonian Tomas Help, (Erzya living in Estonia, who has now published readers for Erzya at the University of Tartu) Niina Aasmäe, and Erzya writer/activist Boris Erushov (Erüš Vežaj).

Active writers in Wikimedia are a couple of handfuls. One group, consisting of native speakers and language learners, writes in the traditional Cyrillic script, while the other, which also consists of native speakers and language learners, writes in the Roman script adopted in the 1990s (I'll need some exact dates here, of course).

There are Kazakh, Tajik, and Uzbek Wikipedias that have bidirectional latn-cyrl language converter systems.

I am aware there are different resolutions for Wikipedia development that must be borne in mind when discussing this feature.

The Hungarian option was to make two separate projects when there arose a controversy over familiarity versus politeness, but this will not work for Erzya, because there just are not that many Erzyas contributing.

The Tatar option is to write in one Wikipedia using two different writing systems (Roman and Cyrillic), where the Cyrillic script is predominant, but the same article with varied content may be written in both Roman and Cyrillic scripts. Where two article exist for the same subject, there is interlinking. This is problematic for the scientific community, because it is difficult to develope meaningful tools, when the dump materials are mingled. For the contriubutors, this means that duplicate work is being done, and this is not desireable for the minimal Erzya Wikipedians. It also presents a questionable variety for the readership, as articles with the same name do not necessarily have the same information, e.g.
(Cyrillic) Фәүзия Бәйрәмова (with official Russian name)
(Roman) Fäwziä Bäyräm (with Tatar-language name)

The Kazakh option is to have a converter where the texts can be written in both Roman and Cyrillic scripts in addition to Arabic script. I am not certain, however, as to whether all three scripts can be used for editing Wiki articles at this time.

To cope with articles written in the Roman script, I have written two different conversions: one a simple python and the other simply several lines of perl. This is what I have implemented on my own command line, so that we can retain a relatively clean monosystem for the articles, i.e, we have tried to convert all articles written in Erzya with Roman script into Cyrillic script, which is actually quite straight forward, and a reverse direction can also be developed for the same purpose.

(list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

Benefits (why should this be implemented?):

The conceivable benifits from this undertaking are that (1) we will be able to gain additional contributors to Erzya Wikipedia; (2) people not versed in Roman scripts will be given more opportunities to familiarize themselves with this media and thence have better access to fieldwork materials rendered in the Finno-Ugric transcription (aka UPA); (3) those not fluent in the Cyrillic system will have access to the same reading experience of abundant materials now being made available from the 1920s and 1930s in Wikisource; (4) with the increase of Erzyas active in the Wikipedia community, there will be more contributions to quality and discussion, and (5) the materials will be presented in a single script, which will make it easier for use in open-source development of writing tools, on the one hand, and give no reason for complaints about "Illegal" use of non-Cyrillic writing systems from "Wiki monitors" with no working knowledge of the language, on the other.

All in all, I see this as a workable solution for Erzya language development.