Page MenuHomePhabricator

Overrides for compounds of polyphonic characters on Chinese collations (pinyin and zhuyin)
Open, Needs TriagePublic

Description

ICU/CLDR only defined 3 compounds for pinyin, even leaving zhuyin unchanged:
https://github.com/unicode-org/cldr/blob/main/common/collation/zh.xml#L3882-L3885

# Compounds, add manually for now
&虫<重庆/庆 # Here 重 collates as chóng/9stk/rad166, between 虫 6stk/rad142, 崇 11stk/rad46
&弞<沈阳/阳 # Here 沈 collates as shěn/7stk/rad85, between 弞 7/stk/rad57, 审 8stk/rad40
&銺<藏文/文 # Here 藏 collates as zàng/17stk/rad140, between 銺 15stk/rad167, 臓 18stk/rad130

That's clearly not enough, and due to the limitation of PHP's Collator implementation, custom rules are not available, unlike the official ICU implementation in Java and C.

This would be a big barrier to putting these collations into use.

A workaround is to preprocess the text for sortkey generation, walking through the text with an IntlBreakIterator created via IntlBreakIterator::createWordInstance( 'zh' ) and replace polyphonic characters in predefined compounds with non-polyphonic alternatives in sorting order, e.g. 重阳->虫阳 as demonstrated by the CLDR definition above.

Unlike custom rules, the workaround would sort the word with exactly the same weight as the alternative, so some additional sorting by strokes may be needed on the category page.

Event Timeline

I wonder if there is any ready-made ruleset that maps words starting with polyphonic characters to the correct pinyin in that context. A quick search shows that users at zh.moegirl.org.cn (already using Extension:PinyinSort) manually configure the collation key. LaTeX packages authors (zhmakeindex, biblatex-gb7714) suggest similar workarounds.

A candidate is the dictionaries (词库) made for pinyin IME. Some are available in open licenses.

Although if such infrastructure to configure the rules is in place, a wiki community can create a ruleset instead of setting every affected page manually (or through some template with a look up table)