We're trying to work out how zh characters are stored.
Description
Related Objects
- Mentioned Here
- T191571: LanguageConverter::guessVariant should go away
Event Timeline
See languages team at WMF: https://www.mediawiki.org/wiki/Wikimedia_Language_engineering
Chinese Wikipedia holds five variants: zh-cn, zh-sg, zh-hk, zh-tw, zh-mo. First two of these use Simplified Chinese and latter three uses Traditional Chinese. I am told by Zhaofeng_Li on IRC that all five has a level of unique phrases to a degree, it would be difficult to distinguish between them. It is possible to distinguish between Simplified Chinese and Traditional Chinese as they use different character sets. Mind that changing one variant of Chinese to another in wikitext is considered vandalism.
The wiki saves the input of the user in the diff so if the user inputs in Simplified Chinese the diff will show as such.
Below is an example diff of zh-cn/zh-sg or Simplified Chinese.
Below is an example diff of zh-hk/zh-tw/zh-mo or Traditional Chinese
The below notation is used in wikitext when an automatic conversion could not be performed.
https://zh.wikipedia.org/w/index.php?title=%E6%B3%B0%E5%9D%A6%E5%B0%BC%E5%85%8B%E5%8F%B7&action=edit
-{zh-cn:'''泰坦尼克號'''; zh-tw:'''鐵達尼號''';zh-sg:'''鐵達尼'''; zh-hk:'''鐵達尼號''';}-
The idea here is to identify user input from each revision even if the template is not used but we would of course benefit of training our models using this template to distinguish between variants.
FYI, the hardcoded conversion table is includes/ZhConversion.php which may help determine the language variant of arbitrary text.
...and there is also these local tables:
https://zh.wikipedia.org/wiki/Special:PrefixIndex/MediaWiki:Conversiontable/zh
LanguageZh does not appear to implement guessVariant and please do not implement it: T191571: LanguageConverter::guessVariant should go away