Page MenuHomePhabricator

[Spike] Investigate how Chinese writing variants are stored
Closed, ResolvedPublic

Description

We're trying to work out how zh characters are stored.

Event Timeline

ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko raised the priority of this task from to Normal.
ToAruShiroiNeko claimed this task.
ToAruShiroiNeko added a subscriber: ToAruShiroiNeko.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 26 2015, 1:04 AM
Halfak updated the task description. (Show Details)Dec 4 2015, 6:27 PM
Halfak added a project: Language-Team.
Halfak set Security to None.
ToAruShiroiNeko added a comment.EditedDec 31 2015, 9:37 AM

Chinese Wikipedia holds five variants: zh-cn, zh-sg, zh-hk, zh-tw, zh-mo. First two of these use Simplified Chinese and latter three uses Traditional Chinese. I am told by Zhaofeng_Li on IRC that all five has a level of unique phrases to a degree, it would be difficult to distinguish between them. It is possible to distinguish between Simplified Chinese and Traditional Chinese as they use different character sets. Mind that changing one variant of Chinese to another in wikitext is considered vandalism.

The wiki saves the input of the user in the diff so if the user inputs in Simplified Chinese the diff will show as such.

Below is an example diff of zh-cn/zh-sg or Simplified Chinese.

https://zh.wikipedia.org/w/index.php?title=%E7%8B%AC%E7%89%B9%E7%A4%BE%E4%BC%9A_%28%E5%8A%A0%E6%8B%BF%E5%A4%A7%E6%94%BF%E6%B2%BB%29&curid=5185622&diff=38614646&oldid=38614626

Below is an example diff of zh-hk/zh-tw/zh-mo or Traditional Chinese

https://zh.wikipedia.org/w/index.php?title=%E5%8F%B0%E6%B9%BE%E5%90%8D%E5%98%B4%E6%B1%87&curid=4034839&diff=38614612&oldid=34971394

The below notation is used in wikitext when an automatic conversion could not be performed.

https://zh.wikipedia.org/w/index.php?title=%E6%B3%B0%E5%9D%A6%E5%B0%BC%E5%85%8B%E5%8F%B7&action=edit
-{zh-cn:'''泰坦尼克號'''; zh-tw:'''鐵達尼號''';zh-sg:'''鐵達尼'''; zh-hk:'''鐵達尼號''';}-

@ToAruShiroiNeko: We have a much easier way: use template {{noteTA}}.

Pinging @liangent and @Chiefwei for more information about that template.

The idea here is to identify user input from each revision even if the template is not used but we would of course benefit of training our models using this template to distinguish between variants.

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

ToAruShiroiNeko renamed this task from Investigate how Chinese writing variants are stored to [Spike] Investigate how Chinese writing variants are stored.

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

There is a function LanguageConverter::guessVariant().

Halfak closed this task as Resolved.Jan 21 2016, 3:43 PM
cscott added a subscriber: cscott.Apr 5 2018, 9:50 PM

FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.

There is a function LanguageConverter::guessVariant().

LanguageZh does not appear to implement guessVariant and please do not implement it: T191571: LanguageConverter::guessVariant should go away