Page MenuHomePhabricator

Review use of CJK vs ICU default language analyzers for "Chinese" Wikis
Open, NormalPublic

Description

In relation to T177871 (and before that T147959), we discussed reviewing the language analysis config for "other" zh-* languages. None of them were affected by the changes in fallback language usage, and it's a non-trivial question, so I'm creating a new task to be prioritized separately.

"Chinese" refers to a family of languages, not all of which are mutually intelligible. The wikis in these languages (and the writing systems they use) are listed in a table on English Wikipedia. (Note that as of the time of this writing, Classical Chinese is discussed beneath the table.)

The relevant languages don't all have codes that start with zh-, so we should look at zh-min-nan, zh-yue, and zh-classical, but also cdo, wuu, hak, and gan.

Note that several of the wikis use Latin romanization in addition to either Traditional or Simplified characters.

A quick survey of a couple dozen random articles on zh-min-nan.wikipedia.org show that all are written in romanized Chinese.

A quick test on (Chinese) text from today's zh-yue.wikipedia.org front page show that the CJK analyzer generates overlapping bigrams, while the ICU tokenizer generates tokens in a variety of lengths (1 being the most common, 14 being the longest).

So it's likely that switching to the CJK analyzer would be better, but I'm not sure if we should do anything for the ones that use both Traditional and Simplified, because I'm also not sure if the S2T conversion we use for zhwiki will do the right thing for other varieties of Chinese.

I also happened to notice that Zhuang/Vahcuengh (za) is a Tai language mostly written in Latin characters, but with a lot of Chinese characters. We should look at that, too.

Anyway, it's complicated, so it needs its own ticket!

Event Timeline

TJones created this task.Oct 10 2017, 9:20 PM
Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptOct 10 2017, 9:20 PM

@TJones FWIW, Cantonese is the only dialect that doesn't fallbacked to zh-hant or zh-hk (although there's suggestion on twn (I suggested) to make it possible).

And, isn't it will be love to not only do things on Wikipedias, but also zhwiktionary, zhwikibooks, zhwikivoyage...?

TJones renamed this task from Review use of CJK vs ICU default language analyzers for "Chinese" Wikipedias to Review use of CJK vs ICU default language analyzers for "Chinese" Wikis.Oct 11 2017, 12:29 PM

And, isn't it will be love to not only do things on Wikipedias, but also zhwiktionary, zhwikibooks, zhwikivoyage...?

Sorry! It's actually easier to do things for all projects in a language than just one project, so I didn't mean to imply that the other projects wouldn't also be covered. I think looking at the article called "Chinese Wikipedia" prompted me to mistype. I've edited the task title.

I'll look into the Cantonese situation—thanks for the info.

debt triaged this task as Normal priority.Oct 12 2017, 5:08 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.
Base added a subscriber: Base.Oct 17 2017, 11:33 PM