Page MenuHomePhabricator

Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system
Open, Needs TriagePublic


The LanguageConverter system, after being deployed to languages other than Chinese, has displayed its ability in converting across scripts. One thing apparently left to do here is the conversion of the Chinese language into other scripts, such as Hanyu Pinyin (zh-latn-pinyin), just like what's being done in English with the test variant Pig Latin (en-x-piglatin). The conversion will necessarily be unidirectional, as going the other way means a lot of ambiguities and breaking a lot of things.

Hanyu Pinyin itself can be fun and slightly challenging to implement. As with other languages LC can start with a character-by-character pronounciation table, combined with a word-based table for chracters with multiple pronounciations. The orthography will necessciate the conversion of punctuations and automatic capitalization, which can be trickier within the current LC framework. Whitespace handling, something not usually seen in zh, will also be heavily tested.

Event Timeline

This task needs an actual software project tag which allows someone to find this task, so I am adding MediaWiki-Language-converter

I'm not sure if this is indeed or not, as there's Pinyin test Wikipedia available on Incubator.

Vvjjkkii renamed this task from Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system to 1zdaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
1339861mzb renamed this task from 1zdaaaaaaa to Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system.Jul 1 2018, 5:39 PM
1339861mzb updated the task description. (Show Details)
1339861mzb updated the task description. (Show Details)

A better justification might be demonstrating such a feature for... (coughs) min-nan Wikipedia, which IMO has a pretty terrible situation with Romanization and scripts last time I checked. They seem to be using the article pages to hold the romanized article and the talk pages for the Chinese characters, which is obviously suboptimal and often not synced.

I'm working on LanguageConverter these days, so this would indeed be an interesting project. I could advise, but I don't have the linguistic expertise to actually develop the converter.

There are some precompiled zh-pinyin database at (MIT license, although the Unihan derivatives may be taken otherwise), as well as for phrases (

LC has a single-character table for Hans/Hant conversion; I suggest that we model the single character Han-to-Latn table similarly, in the form of a bunch of "中" => " zhōng". For characters with multiple readings, we take the first entry assuming it would be the most common.

The phrase/word table will need a bit of changes for whitespaces. Insteading of doing "一叶扁舟" => " yí yè piān zhōu", we delete most (before we can review it, all) whitespaces to conform to the orthography so it becomes "一叶扁舟" => " yíyèpiānzhōu". There is also plenty of processing required for the -儿 suffix; basically every " er" suffix should be replaced with "r" in that dataset. The current approach, which involves just taking the longest match with strtr, is sufficient.

Note the leading space in both examples: they serve to separate words implicitly. The pinyin orthography is ironically better described on en.wp than on zh.wp, so allow me to just point you to,_capitalization,_initialisms_and_punctuation.

Note on punctuations: the official standard only specifies replacements from —|。|……|、 to -|.|…|,, since they are the only ones that differ in the *form* of the characters. But different characters are nevertheless used for CJK versions of many western punctuations. A more complete conversion would be replacing 《|》|〈|〉|~|——|—|「|」|:|;|?|!|。|,|、|…… with «|»|‹|›|~|—|–|“|”|:|;|?|!|.|,|,|…. Just use your "pretty Unicode" English common sense. (I am slightly saddened by the fact that this conversion will get rid of the extra layer of listing provided by the ideographic comma. May I suggest replacing with (U+2E41) instead?)