Page MenuHomePhabricator

[Spike] Investigate how Chinese writing variants are stored
Closed, ResolvedPublic

Description

We're trying to work out how zh characters are stored.

Event Timeline

ToAruShiroiNeko claimed this task.
ToAruShiroiNeko raised the priority of this task from to Medium.
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko subscribed.
Halfak added a project: Language-Team.
Halfak set Security to None.

Chinese Wikipedia holds five variants: zh-cn, zh-sg, zh-hk, zh-tw, zh-mo. First two of these use Simplified Chinese and latter three uses Traditional Chinese. I am told by Zhaofeng_Li on IRC that all five has a level of unique phrases to a degree, it would be difficult to distinguish between them. It is possible to distinguish between Simplified Chinese and Traditional Chinese as they use different character sets. Mind that changing one variant of Chinese to another in wikitext is considered vandalism.

The wiki saves the input of the user in the diff so if the user inputs in Simplified Chinese the diff will show as such.

Below is an example diff of zh-cn/zh-sg or Simplified Chinese.

https://zh.wikipedia.org/w/index.php?title=%E7%8B%AC%E7%89%B9%E7%A4%BE%E4%BC%9A_%28%E5%8A%A0%E6%8B%BF%E5%A4%A7%E6%94%BF%E6%B2%BB%29&curid=5185622&diff=38614646&oldid=38614626

Below is an example diff of zh-hk/zh-tw/zh-mo or Traditional Chinese

https://zh.wikipedia.org/w/index.php?title=%E5%8F%B0%E6%B9%BE%E5%90%8D%E5%98%B4%E6%B1%87&curid=4034839&diff=38614612&oldid=34971394

The below notation is used in wikitext when an automatic conversion could not be performed.

https://zh.wikipedia.org/w/index.php?title=%E6%B3%B0%E5%9D%A6%E5%B0%BC%E5%85%8B%E5%8F%B7&action=edit
-{zh-cn:'''泰坦尼克號'''; zh-tw:'''鐵達尼號''';zh-sg:'''鐵達尼'''; zh-hk:'''鐵達尼號''';}-

Pinging @liangent and @Chiefwei for more information about that template.

The idea here is to identify user input from each revision even if the template is not used but we would of course benefit of training our models using this template to distinguish between variants.

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

ToAruShiroiNeko renamed this task from Investigate how Chinese writing variants are stored to [Spike] Investigate how Chinese writing variants are stored.Jan 1 2016, 4:33 PM

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

There is a function LanguageConverter::guessVariant().

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

There is a function LanguageConverter::guessVariant().

LanguageZh does not appear to implement guessVariant and please do not implement it: T191571: LanguageConverter::guessVariant should go away