Page MenuHomePhabricator

Use OpenCC for Parsoid's implementation of LangConv
Open, Needs TriagePublic

Description

Current exploration of Parsoid's implementation of LangCov is using FST, which may not be suitable for Chinese due to its character sets (If including all Unihan characters, there may be around ten thousand characters). Using FST may also make current hard-to-maintain Chinese conversion rule even harder to maintain.

OpenCC (https://github.com/BYVoid/OpenCC) is a robust Chinese conversion tool, which the author recently restart the development work by including NPM implementation of OpenCC. It might be best to utilize OpenCC's conversion method instead and let ZhReplacementMachine's convert method to directly call OpenCC for conversion.

Overall, this is my personal opinion and OpenCC does have its limitation. However, it may be a better and quicker solution for Chinese Conversion than developing another FST, but FST is a better choice for all other writing system's transliteration.

Event Timeline

There's also a bunch of transliteration support in libicu which I'd love to build on, instead of replacing.

One major issue w/ all of these schemes is conversion of existing content on zhwiki (and the other wikis which use transliteration). We'd need robust infrastructure to ensure that working/readable pages don't get broken by changing the transliteration scheme. That's probably a big wikilinting task.

The benefit of the FST approach is that we can port all of the existing transliteration tables from existing mediawiki to the new FST and get down to a very small number of differences between them. That eases migration. There are issues with building the FST due to the combination of character set size and length of match, as you've pointed out.

@cscott, I will move this to the future idea for consideration, as we will need to move away from an on-page handwritten rule-based system to an automatic system with customized dictionary per page generated by Wikidata (Tools like OpenCC support customized dictionary on top of standard dictionary for conversion).

What I am worried about the current FST implementation is that it is the same as the old Language Converter that does not consider the Chinese word segmentation issue. For example, the mobile in "mobile phone" needed to be converted to a regional-specific variant, but not for the mobile in "China Mobile". This can be easily missed if we are using FST, which currently only care about single characters. If we think about all of these compound words, we will easily overwhelm the FST system. We have rules because the current LC cannot correctly transliterate a lot of word compounds, which OpenCC can correctly transliterate. Honestly, if we use OpenCC, we can throw the majority, if not all of the rule in the existing transliteration table, meaning we do not even need to port the transliteration table at all in the first place, and achieve the same goal, or even achieve better transliteration result.

P.S I am pessimistic about using libicu, as almost no one use it in Chinese website for Chinese conversion (except for indexing), the one who uses it report poor performance (See https://www.dazhuanlan.com/2019/11/28/5ddf508d02021/).