Maniphest T193366

Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Arthur2e5
	Apr 30 2018, 2:16 AM

Description

The LanguageConverter system, after being deployed to languages other than Chinese, has displayed its ability in converting across scripts. One thing apparently left to do here is the conversion of the Chinese language into other scripts, such as Hanyu Pinyin (zh-latn-pinyin), just like what's being done in English with the test variant Pig Latin (en-x-piglatin). The conversion will necessarily be unidirectional, as going the other way means a lot of ambiguities and breaking a lot of things.

Hanyu Pinyin itself can be fun and slightly challenging to implement. As with other languages LC can start with a character-by-character pronounciation table, combined with a word-based table for chracters with multiple pronounciations. The orthography will necessciate the conversion of punctuations and automatic capitalization, which can be trickier within the current LC framework. Whitespace handling, something not usually seen in zh, will also be heavily tested.

Related Objects

Mentioned In: T165882: New namespace for zh-min-nan Wikipedia
Mentioned Here: T165882: New namespace for zh-min-nan Wikipedia

Event Timeline

Arthur2e5 created this task.Apr 30 2018, 2:16 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2018, 2:16 AM

Arthur2e5 updated the task description. (Show Details)Apr 30 2018, 2:16 AM

Liuxinyu970226 subscribed.Apr 30 2018, 4:34 AM

This task needs an actual software project tag which allows someone to find this task, so I am adding MediaWiki-Language-converter

Shizhao moved this task from Backlog to MediaWiki core on the Chinese-Sites board.May 2 2018, 3:13 AM

Shizhao subscribed.May 8 2018, 3:01 AM

Shizhao unsubscribed.

I'm not sure if this is indeed or not, as there's Pinyin test Wikipedia available on Incubator.

• Vvjjkkii renamed this task from Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system to 1zdaaaaaaa.Jul 1 2018, 1:13 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed a subscriber: Aklapper.

1339861mzb renamed this task from 1zdaaaaaaa to Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system.Jul 1 2018, 5:39 PM

1339861mzb updated the task description. (Show Details)

JJMC89 raised the priority of this task from High to Needs Triage.Jul 1 2018, 5:40 PM

JJMC89 removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

JJMC89 added a subscriber: Aklapper.

A better justification might be demonstrating such a feature for... (coughs) min-nan Wikipedia, which IMO has a pretty terrible situation with Romanization and scripts last time I checked. They seem to be using the article pages to hold the romanized article and the talk pages for the Chinese characters, which is obviously suboptimal and often not synced.

@Arthur2e5 isn't your comment related to T165882 instead?

BJ6123C7BTD subscribed.Jul 12 2018, 10:02 AM

cscott subscribed.Aug 28 2018, 4:52 PM

I'm working on LanguageConverter these days, so this would indeed be an interesting project. I could advise, but I don't have the linguistic expertise to actually develop the converter.

There are some precompiled zh-pinyin database at https://github.com/mozillazg/pinyin-data (MIT license, although the Unihan derivatives may be taken otherwise), as well as for phrases (https://github.com/mozillazg/phrase-pinyin-data).

LC has a single-character table for Hans/Hant conversion; I suggest that we model the single character Han-to-Latn table similarly, in the form of a bunch of "中" => " zhōng". For characters with multiple readings, we take the first entry assuming it would be the most common.

The phrase/word table will need a bit of changes for whitespaces. Insteading of doing "一叶扁舟" => " yí yè piān zhōu", we delete most (before we can review it, all) whitespaces to conform to the orthography so it becomes "一叶扁舟" => " yíyèpiānzhōu". There is also plenty of processing required for the -儿 suffix; basically every " er" suffix should be replaced with "r" in that dataset. The current approach, which involves just taking the longest match with strtr, is sufficient.

Note the leading space in both examples: they serve to separate words implicitly. The pinyin orthography is ironically better described on en.wp than on zh.wp, so allow me to just point you to https://en.wikipedia.org/wiki/Pinyin#Words,_capitalization,_initialisms_and_punctuation.

Note on punctuations: the official standard only specifies replacements from —|。|……|、 to -|.|…|,, since they are the only ones that differ in the *form* of the characters. But different characters are nevertheless used for CJK versions of many western punctuations. A more complete conversion would be replacing 《|》|〈|〉|～|——|—|「|」|：|；|？|！|。|，|、|…… with «|»|‹|›|~|—|–|“|”|:|;|?|!|.|,|,|…. Just use your "pretty Unicode" English common sense. (I am slightly saddened by the fact that this conversion will get rid of the extra layer of listing provided by the ideographic comma. May I suggest replacing 、 with ⹁ (U+2E41) instead?)

Angrydog001 subscribed.Jun 2 2019, 11:51 AM

Winston_Sung subscribed.Jul 7 2021, 7:16 PM

Winston_Sung moved this task from Backlog to Converter-specific on the MediaWiki-Language-converter board.Mar 18 2023, 4:08 AM

Restricted Application added a subscriber: Ericliu1912. · View Herald TranscriptMar 18 2023, 4:09 AM

Winston_Sung moved this task from Converter-specific to New converter / variant on the MediaWiki-Language-converter board.Aug 10 2023, 8:40 AM

Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing systemOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Unidirectional Hanyu Pinyin output for Chinese LanguageConverter, as a proof-of-concept/testing system
Open, Needs TriagePublic
Actions