Page MenuHomePhabricator

A generalized language conversion engine
Open, LowPublic


Author: millosh

(I am writing it here because, AFAIK, it is necessary to make changes inside of Edit.php for the implementation of this idea. Also, some DB changes are needed, too. However, feel free to move it wherever you think that it should stay.)

Present situation of conversion engine, designed by Zhengzhu, may be described as:

  • There is one form archived in DB.
  • Contributors have to know both (or more than two) scripts if they are willing to edit pages.
  • Scripts have to be generally 1:1 in substituting elements.

Such approach may work in classical examples, like Chinese or Serbian engines are. Every educated Serbian knows Cyrillic and Latin alphabets (Cyrillic is learned from the 1st year of the primary school, Latin is learned from the 2nd year of the primary school). AFAIK, it is not so hard for one Chinese to find a meaning of a character from a non-native script. Also, in both examples scripts corresponds almost 100% 1:1 (there are some exceptions, but it is not so hard to add them inside of the markup for exceptions: -{ ... }-). (There are maybe up to 10 of implementations of this principle all over the MediaWiki languages.)

However, there are a number of very different situations in the world. Some scripts differ from each other a lot and education issues may be significant. For example, while Tajik and Persian are structurally the same language systems, it is not so common to find a Tajik who is able to read Perso-Arabic script and Persian who is able to read Cyrillic script. Also, there are complex issues in relation to the "interpunctional behavior" of letters: there are somewhat different rules for usage of small and capital letters in Cyrillic (Latin, Greek) and Arabic scripts.

So, the goals of the generalized conversion engine for MediaWiki are:

  • Allow to contributors to see and edit pages in their preferred script.
  • Make an open set of rules which may be applied easily for different cases.
  • Solve different kinds of "interpunctional behavior" problems in a generalized manner.
  • Introduce a dictionary-based conversions. (This was initially introduced into the Serbian engine for Ekavian-Iyekavian paradigm, but it was abandoned because no work on that issue was done after the initial implementation.)
  • A future goal, completely possible if this engine is implemented: Transform a conversion engine to a user-side feature. When script differences are great, for some user it is easier to try to read the content in the preferred script (for example, for one European it is easier to read Chinese transcribed to Latin).

I was thinking about some of the approaches to this issue and I may guarantee that there are better ones :) However, I'll list some of them:

  • There should be fields in database for different versions of the article. Or, inside of one field it should be possible to separate different versions. Here is the example for the second idea:
    • There are a lot of situation when forms are exceptional. A classic example is from the relation of Latin and Arabic scripts. Arabic script doesn't recognize capital letters (or they have different rules).
    • So, if the sentence is beginning with "Llll" in Latin, which is transcribed as "aaaa" in Arabic, form in the database should be something like -{ Latin: Llll; Arabic: aaaa }-. However, such markup shouldn't be seen from the side of editor. Editor of Latin text should see just "Llll" and editor of Arabic text should see just "aaaa".
    • In this case, if editor of Latin text changes it, general rules should be applied. If editor of Arabic text changes it, some specific rules should be applied (like: if previous word has dot at the end, the letter should be capital, if not, the letter should stay small in Latin). But, if it is not correct in Latin (for example, the word is personal name and it is in the middle of the sentence), then when editor of Latin text is fixing the text, from "llll" (which corresponds to "aaaa") to "Llll" (which, also, corresponds to "aaaa") should be changed with -{ Latin: Llll; Arabic: aaaa }-.
    • Of course, both editors should be able to go into the "meta mode", which would show to them all of the markup and allow them to make fine tuning.
    • When everything is changed (major edit), some general and specific rules should be followed, but, also, it should be allowed to editors to fix errors.

The main issue why I am writing this as a bug is that I am not a PHP programmer (while I am able to program in PHP :) ), which means that I am not able to solve all of the complex programming issues needed for MediaWiki. However, as a [formal] linguist I am willing to participate actively in working on this issue. I am willing to cover all of the linguistic work needed for this (including finding relevant persons for problems related to different scripts).

Version: unspecified
Severity: enhancement



Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:20 PM
bzimport set Reference to bz15161.

I'm not sure what you mean by dictionary-based conversion, since the current converter already has dictionary-like tables. However there is a need for word segmentation, which the current converter does not have.

I'd like to know if there's a pre-existing domain-specific language we can use for rules, which users might be able to understand and edit, similar to the snowball language which is used for stemming.

I'm interested in transformations which are particularly challenging from a computational perspective, such as Arabic vowel marking. Life without challenges can be very dull. References would be appreciated.

millosh wrote:

Dictionary-based conversion: The simplest example is difference between British and American English (and a lot of languages or language variants have such need): kilometer-kilometre. I didn't see the engine for more than a year. And I am not sure how hard it would be to implement such engine efficiently (while Robert said to me that it is possible). For example, Serbian has a class of ~10.000 lexemes * ~10 forms which give ~100.000 words for replacing (Ekavian/Iyekavian differences), with probably two of ten words which need to be replaced inside of the text.

I would use regular expressions as a domain-specific language + some simple syntax for variables, Perl-like or Python-like, or even XML-like. It is a part of contemporary education of linguists and there are a lot of programmers on wiki, who may help.

I would start with two or three "simple" tasks, so we may see where we are going:

  • MediaWiki tasks:
    • Editing Edit.php in such way that wiki text in database doesn't need to correspond with wiki text inside of edit box.
    • Making a simple, but extensible syntax for adding conversion rules; with on-wiki pages for conversion.
  • Language-specific tasks:
    • The problem of Arabic-Latin conversion: capital letters, vowel marking (when it is possible).
    • Two paradigms problem in Serbian: Cyrillic/Latin and Ekavian/Iyekavian.

I'll start with finding references. The problem related to this field is that it is relatively obscure: while regular expressions and some interpreted (AWK, Perl, Python) and markup languages (SGML and XML) became a part of education of linguists, it is still too early for papers in that field. Also, syntax is now much more interesting because of translation engines; and translation engines work on completely different models.

I suppose that the best which we may find is:

  • Japanese script usage, which might cover a lot of problems. However, I am not sure how many of those papers are available in English.
  • Some (probably very) general theories.
  • Classical philological papers, like descriptively listing the rules. Such papers are usually written in native language. For example, I may find good references for Ekavian/Iyekavian conversion in Serbian.
  • Blogspot/Google has Latin-Devanagari conversion engine. So, probably some papers on Hindu or similar may be find.

So, we should make some plan what to do step by step. Maybe the best idea is to articulate a project and to find interested linguists...

The current converter has a conversion table, which has entries of any length, they are not required to be single characters. Where multiple table entries match at a given location in the source string, the longest one is chosen. This works well for Chinese where groups of 2 or 3 distinctive characters (occasionally up to 6) need to be treated as a unit. But it's rather awkward for languages like English, since a rule "color -> colour" would cause colorado to be converted to colourado. A better algorithm for languages like English would be to split the string into words, and then to do a hashtable lookup for each word.

Regarding the number of table entries: Chinese has 6500 for simplified and 9600 for traditional. As long as the table fits in memory, lookups are fast enough, but the per-request initialisation speed is already quite slow for Chinese and would be much worse if the table was 10 times bigger. Some optimisation work is needed. With the initialisation overhead removed, say by better caching, then we could do a table with millions of entries.

millosh wrote:

Ah, as a linguist, I am making difference between grapheme-level dictionary and word-level dictionary. Just word-level is a "dictionary" for me :)

Yes, words should be extracted from text.

Also, I really think that wiki text inside of database should be some kind of meta-wiki text. It consumes much less processor power if you do the conversion once, when text is submitted, than if you use a bunch of different rules whenever you want to read a page.

And when we have extracted words and if it is used just when text is submitted, we may be able to make even more language conversions and markups.

Visual Editor already contains code to detect changed regions of pages and reserialize only the changed regions, in order to minimize dirty diffs. It seems like this is a good foundation for implementing a more intelligent language converter: the entire article can be language-converter in the editor, but only the changed regions will get resaved in the translated variant.

It would probably be worth marking the variant used in each changed region as well. That might be easier to represent in the DOM than in wikitext.

I am interested in working on this problem.

See also: bug 26121, bug 31015, bug 52661, and

Some discussion from IRC. Pig Latin would be a good english variant to explore some of the non-reversible language variant pairs (like Arabic/Latin).

(12:03:51 PM) cscott: James_F: i've been talking to liangent about language converter. it would be nice if VE could present the text to be edited in the proper language variant. the way that VE/parsoid selser works makes this feasible I think.
(12:04:40 PM) cscott: that is, we convert all the article text, but we only re-save the edited bits (in the converted variant). needs some thought wrt how diffs appear, etc.
(12:04:48 PM) James_F: cscott: That sounds totally feasible - your talking about VE requesting zh-hans or zh-hant (or whatever) from Parsoid and showing that?
(12:05:23 PM) cscott: James_F: something like that. not sure where in the stack language conversion will live exactly. gwicke_away is talking about it as a post-processing pass.
(12:05:42 PM) cscott: this would also allow language converter to work on portuguese and even en-gb/en-us.
(12:06:32 PM) cscott: ie, you always see 'color' in VE even if the source text was 'colour', but it doesn't get re-saved as 'color' unless you edit the sentence containing the word. (or paragraph? or word?)
(12:06:34 PM) James_F: Like link target hinting.
(12:07:04 PM) James_F: Selser is paragraph-level right now, I think?
(12:07:39 PM) cscott: i'm not sure, but i think so. html element-level.
(12:08:39 PM) cscott: it might be that we want to be more precise for better variant support -- or maybe not. maybe element-level marking of lang= is right (it avoids adding spurious <span> tags just to record the language variant) and we just want to be smarter about how we present diffs.
(12:09:18 PM) cscott: ie, color->colour shouldn't appear as a diff. (or for serbian, the change from latin to cyrillic alphabet shouldn't be treated as a diff, if the underlying content is the same)
(12:10:36 PM) cscott there are some tricky issues -- for some language pairs one encoding has strictly more information than the other. ie, in languages with arabic and latin orthographies, uppercase letters are specific to the latin script. so if the user writes the text natively in arabic, we won't necessarily know the correct capitalization (and the capitalization of the rest of the paragraph might be lost).
(12:11:05 PM) cscott: so lots of details. but we should be able to handle the 'easy' cases (where the languages convert w/o information loss) first.
(12:11:22 PM) ***cscott wonders if pig latin is a reversible transformation
(12:17:06 PM) MatmaRex: cscott: it's not, i'm afraid
(12:17:28 PM) MatmaRex: unless you rely on a dictionary
(12:17:39 PM) MatmaRex: as appleway might come from apple or wapple, i think
(12:17:41 PM) cscott: MatmaRex: well, i guess that makes it a great stand-in for the 'tricky' languages.
(12:18:20 PM) cscott: so much the better. ;)
(12:18:29 PM) MatmaRex: it's only the words starting with a vowel that are troublesome, though
(12:20:59 PM) cscott: i think the idea is that, if i edit in en-pig and type 'appleway' it should get saved as appleway and probably a default translation into en-us should be made? (ie, in the latin/arabic pairs, assume lowercase). There should be a specific UX affordance in VE to specify both sides of the variant, which serializes into -{en-pig:appleway,en-us:apple,en-gb:apple}-.
(12:24:45 PM) cscott: i guess when you edit text which was originally in en-us, it needs to be converted to -{en-us:apple,en-pig:appleway}- by the language converter so that information isn't lost when the edited en-pig text is saved back.

Variant conversion is not bijective, so we can't generally save automatically converted variants without information loss. Even manual conversion of entire sections is considered vandalism in the Chinese Wikipedia. Saving just edited text (down to the word level) would promote more mixed-variant text within the same section, which might not be desirable for wikitext editors.

So this is not easy, and a lot of issues need to be considered. IMO we should first make sure to have solid Parsoid and VE support for unconverted editing of variant-enabled wiki content.

Please see the worked example at the end of comment 6 for how variant conversion can be accomplished without information loss.

(In reply to comment #8)

Please see the worked example at the end of comment 6 for how variant
conversion can be accomplished without information loss.

Right, by storing the original text. Which was my point.

He7d3r renamed this task from A generalized conversion engine to A generalized language conversion engine.Nov 24 2014, 1:23 PM
He7d3r set Security to None.

Resurrecting this task, slightly.

Above when I wrote:

Please see the worked example at the end of comment 6 for how variant conversion can be accomplished without information loss.

I was referring to:

(12:20:59 PM) cscott: i think the idea is that, if i edit in en-pig and type 'appleway' it should get saved as appleway and probably a default translation into en-us should be made? (ie, in the latin/arabic pairs, assume lowercase). There should be a specific UX affordance in VE to specify both sides of the variant, which serializes into -{en-pig:appleway,en-us:apple,en-gb:apple}-.
(12:24:45 PM) cscott: i guess when you edit text which was originally in en-us, it needs to be converted to -{en-us:apple,en-pig:appleway}- by the language converter so that information isn't lost when the edited en-pig text is saved back.

To be more precise, *just before you edit the variant B text* in variant A, you do a reversible transformation from variant B to variant A, which entails adding markup like the apple/appleway example above to ensure that reconverting from B to A yields *exactly* the original variant B text.

That is, assume the original text, in the en-us variant, was:

John was here. I ate an apple.

When I edit in en-pig the source text is first converted to:

<span id=X>Ohnjay -{en-us:was,en-pig:asway}- erehay.</span>
<span id=Y>-{en-us:I,en-pig:Iway}- -{en-us:ate,en-pig:ateway}- -{en-us:an,en-pig:anway}- -{en-us:apple,en-pig:appleway}.</span>

And displayed in VE as:

Ohnjay asway erehay. Iway ateway anway appleway.

I'm using <span> tags to represent selser boundaries for the purposes of illustration; there would not necessarily be actual <span> tags present in the DOM. (The idea would be to use a language-dependent sentence segmentation algorithm; one implementation could indeed use synthetic <span> tags to indicate breaks.) Similarly, I'm using LanguageConverter syntax, but you can mentally substitute an appropriate DOM representation if you prefer. Note that "extra" information only needs to be stored for non-reversible constructs -- in Pig Latin, those are just the words ending in "-way".

If I then changed "appleway" to "orangeway", VE would have an internal document like this:

<span id=X>Ohnjay -{en-us:was,en-pig:asway}- erehay.</span>
<span id=Y>-{en-us:I,en-pig:Iway}- -{en-us:ate,en-pig:ateway}- -{en-us:an,en-pig:anway}- orangeway.</span>

And when this was serialized to wikitext using the fine-grained selser we'd have the following en-us wikitext, which preserves the original en-us text for span X:

John was here. I ate an worange.

Note that this reads correctly when converted to en-pig, but we've introduced an error in the en-us variant: it should be "orange" not "worange". This is part of the inherent tradeoff of LanguageConverter, and occurs frequently in zhwiki during edits. The community prefers errors such as these to be immediately visible so that an en-us speaker can see the problem and fix it quickly. This is different from the delayed model of editing supported by Content Translation.

VE should, of course, provide good visibility of both variants during editing (client-side conversion?) and an explicit UX affordance to edit the "nondefault variant" so that a user fluent in both en-us and en-pig could author -{en-us:orange,en-pig:orangeway}- directly and avoid the variant conversion error.

(One wrinkle: currently LanguageConverter uses no explicit marking of variant in the article text. Stuff which "looks like" variant A is converted to variant B and vice-versa. Articles contain multiple variants interleaved. This seems to mostly work fine in practice, especially for script conversions where character set can be used to identify the variant being used. We probably want to separate the variant sniffing out as an orthogonal issue and introduce some synthetic elements for explicit variant markup -- maybe just <span lang=en-us> -- so that it is clear during html2wt which variant we expect to generate for every region of the article.)

Adding a cross reference here to T43716 (Parsing language converter syntax in parsoid) and T113002/T87652 (dev summit discussion sessions).

Such an interesting idea, but it'll be hard for language that don't even have a space in their writing. Javanese has several writings, two of them are Javanese script and Latin script. Javanese script doesn't differ between capital and lower letter. It also doesn't have any space between words. On the other hand, the Latin script is similiar with English, it differs capital letter and also recognizes space.


  1. British<>American English, Portuguese variants conversion
  2. Different options to enable/disable the system in various way, with additional user settings that allow custom rules and different separated set of rules
  3. Some words on language converter in editing mode
  4. Javascript conversion tool
  5. sentence-based conversion tool
  6. Classical Chinese Kanbun conversion
  7. Mutliple parallel conversion

  1. Would it be a good idea to ask for an implement of language converter that would work between British English and American English? Benefits include there will no longer be needs to force editors to use a fixed variant in any given article as that will be handled by the language converter, and also people can read wikipedia in spellings that they are accustomed to. It'd probably be easier for English-based developers to develop language converter and also attract interest from English-speaking developers onto the module.
    1. And what is currently blocking T28121 from resolving at the current stage? Someone mentioned that Portuguese being a character-based space-delimited language that would require more than word-to-word conversion might make it difficult to implement the language converter into Portuguese wikipedia and then T17161 is listed as a subtask to it, but currently there are already English <> Piggy Latin conversion being made available and that is also word-based?
  2. Currently the enabling/disabling of the availability of language converter is made across the entire wiki, however it might not always suit every needs. For instance, as shown in the recent request for a Cyrillic converter in Romanian wiki T169453, there are some tiny amount of users that would be interested in reading the wiki in Cyrillic alphabet, however most Romanian user reject the idea. The situation is not unique for Romanian Wikipedia, when a javascript-based language conversion tool get implements into Cantonese Wikipedia, there are some similar debates too. Therefore, I think it would be a good idea that
    1. An option should be available to users within individual wikis implemented the language converter. Users should be able to select whether they want the Language Converter button to appear in their personal preference.
    2. A wiki-wide setting should be made available so that admin of wikis could determine according to community consensus that, whether such tool should be enabled to all users, enabled by default but users could switch the tool off, disabled by default but individual user could switch the tool on if they wish to, or disabled entirely even when the language conversion have been installed.
    3. Additionally, setting that allow individual users to configure their own conversion rule would probably be a nice idea too? Currently there are only general rule per region/script/variant of language, but in some cases where there are no formal orthography, individual users might write something slightly different from how others write them. As such, a personalized user option for what to convert could also be a good idea?
    4. And then, in Chinese wikipedia, conversion options currently available include for example unconverted, hans, hant, CN, SG, TW, HK, MO. I think it would probably make more sense that, instead of separating option for simple character conversion of hans/hant away from other regional setting, the system can instead separate the regional variant conversion against script conversion and provide a checkbox on both article reading interface as well as personal preference interface for users to select which level of conversion do they want the conversion tool to perform. The conversion database would need to separate accordingly in such case.
  3. Currently, the language converter does not support converting text in the process of editing in the source mode (not sure about visual mode). I have encountered a few users that are only literate in one of the multiple scripts in use find it difficult to edit in such a situation. (I have just read about a complain about this related to Chinese Wikipedia, and then on Cantonese Wikipedia there are a javascript tool that partially help with it, and then on Korean Wikipedia that have been one of the point against a potential conversion implementation with Hanja script) As such, the language converter should probably support converting languages in the source mode.
    1. However, it is important that the language converter should not save the result it have converted the wikicode into. It should compare the different on what user have edited after the conversion, and then apply only those part back to the original wikicode, and code for things that users have not edited should not be affected by the language conversion tool.
  4. Was there any discussion being made before regarding why Cantonese Wikipedia uses its own Javascript tool to convert between different scripts instead of language converter?
  5. Inner Mongolian University have developed a conversion tool between Cyrillic Mongolian and Traditional Mongolian script known as "Conversion System between Traditional Mongolian and Cyrillic Mongolian" (, and the technical background is apparently discussed in this paper: . It claims that, with a language model based conversion approach, a correction rate of up to 87% can be achieved. However, the approach mean that it must analyse the entire sentence as a whole to figure out the context of individual word before a likely conversion candidate can be selected as conversion result. Which mean a language converter that would work with entire sentence as a unit instead of an individual word as a unit would be required. The situation regarding other languages that use both phonetic writing system as well as ideographic writing system would also be similar. It would probably be a good idea for a generalized conversion system to be able to work based on the context of sentence where individual words are located at.
  6. For Classical Chinese, there is a dedicated system being developed by people, using some very specific rules, to apply different reading marks onto Classical Chinese text, with effects including adding extra characters, ignoring unneeded characters, rearranging the reading order of characters in a sentence and such, so that the text would be rendered understandable by whomever that could understand Old Japanese. If the rules are provided, is it possible for the language converter to:
    1. Automatically add Japanese reading marks onto Chinese text when it is desired, and
    2. Automatically convert Classical Japanese text into Japanese reading order?
    3. Additionally, it seems like similar system have also been developed for other non-Chinese non-Japanese users in history. Is there any further information about that?
  7. Is there a way to run multiple language conversion scheme at the same time? Like for example, Chinese Wikipedia is currently running the conversion of different Chinese scripts between Traditional and Simplified Chinese characters with different regional variants. However, although rarely but in many different cases, pinyin and zhuyin romanization or phonetic writing schemes would also be used in article main text in order to phonetically record something, either because the thing that are to be described is only phonetically noted or because the article is discussing about phonetic itself. Is it possible to additionally develop a pinyin<>zhuyin conversion system and then deploy it onto Chinese WIkipedia, side by side to the Chinese characters conversion system?

Currently the enabling/disabling of the availability of language converter is made across the entire wiki, however it might not always suit every needs.

In support of this. LC can always be provided safely as long as it is not the default behavior (as triggered by Accept-Language).

setting that allow individual users to configure their own conversion rule would probably be a nice idea too

It would be, but that might be customizing things a bit too much... Using fixed tables allow for caching, while customized rules make it sound like a JS version of LC should be used for them. Something along the lines of:

function escapeRegExp(str) {
  return str.replace(/[.*+?^${}()|[\]\\]/g, "\\$&")

function strtr(str, pairs) {
  let regex = new RegExp(
  return str.replace(regex, (match) => pairs[match])

function execute(text) {
  let table = getDefault()
  // some regex match for -{ }- and some calls to strtr

(Oh, in light of -{ }- being mentioned, something still needs to be there to strip off these marks. And to not strip them off when the JS is on.)

Currently, the language converter does not support converting text in the process of editing in the source mode

That's a great point. One of the ways to work around that is just by previewing the page or pasting stuff in the sandbox... but I feel ya. A JS based LC gadget might help in this case too, since mass replacement is obviously bad and the server should not encourage it.

Was there any discussion being made before regarding why Cantonese Wikipedia uses its own Javascript tool to convert between different scripts instead of language converter?

It's terrible! I agree! But I don't speak Cantonese so I don't get a say.

Larger-context conversions (sentence-based, etc.)

This would not be possible with the current strtr() replace scheme, but there is no reason every converter has to use replacement either, I think. Some abstraction at the interface level may do.

Is there a way to run multiple language conversion scheme at the same time?

Yes. Just join the two tables together. Like on zhwp we add IT tables to the basic table, we can also add the pinyin-zhuyin tables.

The main issue is that joining the tables together does not restrict where the tables are applied, so we cannot do it per-letter or it will screw EVERYTHING up; we have to do it by enumerating every possible syllable. If LC allowed for removing entries (or having a stack of tables) we would have an easier time, since we can just add the entires before a pinyin and remove it afterwards.

join the two tables together.

What I mean in term of "multiple language converter" is a script level conversion separate from vocabulary level conversion. Currently the language converter in Chinese Wikipedia actually merged two different things together to handle it, that's difference in script and different in vocabulary. It might be possible for a resident in China Maimland desiring to use Traditional Chinese but use Mainland China vocabulary, but the current language conversion scheme cannot cater such list and it wouldn't make sense to add a separate conversion table just for those minority users who might have complicated family and language leaening background and thus prefer different scropts from the one they currently use