Page MenuHomePhabricator

Allow multiple collations in same site and configure zh collations
Open, MediumPublic

Assigned To
None
Authored By
tstarling
Feb 5 2013, 6:28 AM
Referenced Files
None
Tokens
"Manufacturing Defect?" token, awarded by WhitePhosphorus."Manufacturing Defect?" token, awarded by Liuxinyu970226."Manufacturing Defect?" token, awarded by Shizhao.

Description

Allow multiple collations in same site and set zh collations

  • Default 預設 默认
  • Total stroke 總筆畫數 总笔画数
  • 5-stroke 五筆 五笔
  • Radical + (strokes - radical strokes) 部首 + 剩餘筆畫數 部首
  • Hanyu Pinyin 漢語拼音 汉语拼音
  • Mandarin Phonetic Symbols 注音符號 注音符号

Review and merge chinese-collation branch into master, configure Chinese wikis to use it

Chinese collation is complex, not least because different Chinese-speaking regions have different customary collations. The Kangxi order favoured by Unicode standards is rarely used in any region, except in dictionaries.

Liangent has prepared a core branch which allows multiple category collations to coexist on a single wiki, with selectable sort order on category pages. I helped develop the architecture.

One of the collations considered essential for Chinese wikis is a latin sort of the pinyin transliteration. We will need to upgrade to ICU 4.8 to support this collation.

This bug tracks the tasks needed for merge of the chinese-collation branch and the deployment of a suitable multi-collation configuration on the Chinese Wikipedia.


See Also: T45799: Allow for using language-specific collations for category sorting

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:34 AM
bzimport set Reference to bz44667.

Tagging with "design" keyword, since there's a small dropdown that might use a little love.

According to Jared Zimmerman this morning, Pau did a design review of this recently and everything seemed ok (or, was going to be ok, or similar). Is that right, Liangent?

(In reply to comment #3)

According to Jared Zimmerman this morning, Pau did a design review of this
recently and everything seemed ok (or, was going to be ok, or similar). Is
that
right, Liangent?

Right

(In reply to comment #4)

(In reply to comment #3)

According to Jared Zimmerman this morning, Pau did a design review of this
recently and everything seemed ok (or, was going to be ok, or similar). Is
that
right, Liangent?

Right

Great, thanks for confirming, Liangent.

I'd like this bug to have the needed next steps in it; could you tell me what you think they are from your end, Liangent (and anyone else who sees this bug mail).

Would love to know what needs to be prioritized.

We have libicu48 version 4.8.1.1-3, so presumably nothing else needs doing to that extent.

(In reply to comment #5)

Would love to know what needs to be prioritized.

The merge commit needs the various merge conflicts fixing, and also rebasing again onto core, as it's at least 10 weeks old, if not 18 weeks or so

(In reply to comment #6)

We have libicu48 version 4.8.1.1-3, so presumably nothing else needs doing to
that extent.

(In reply to comment #5)

Would love to know what needs to be prioritized.

The merge commit needs the various merge conflicts fixing, and also rebasing
again onto core, as it's at least 10 weeks old, if not 18 weeks or so

Alright, then I'm assuming Pau is no longer actively working on this, so assigning to Liangent but that's only because they own the merge. Would love someone else on this CC: list to take a look at that merge.

I took a quick look through the code (I was using it to make a prototype of a feature idea: http://tools.wmflabs.org/bawolff/whichisbetter ). It works well. A couple things I noticed though (Note I did not read the code in depth):

*In Title::moveTo, the code seems to assume the cl_sortkey_prefix is the same for all collations. I do not think this is the case.
*When running update.php, the script runs updateCollation.php before doing the schema changes from your code, instead of after. (Really I think it should run it after all extension schema change, in case someone abuses the Collation framework, to make a collation that depends on a schema change). Arguably this issue was here before your code.
*From a language perspective, I think using the phrase "Sorting method" instead of "collation" for the message 'category-collation' would be better and less jargony. Some of the collation names ( 'Identity' ) are a bit jargony as well, but I guess that can't really be helped. We can't exactly use the word 'alphabetical', since they're all alphabetical.
*On category pages, <label for="mw-collation-select">Sorting method:</label> should have an id or class attribute so people could style it easily. Additionally I think it might look better with the css vertical-align: bottom.

I'd submit gerrit patches for some of these, but I'm kind of unclear how to do that/should I do that given I don't really understand how long-term feature branches in gerrit are supposed to work. Should I just submit new patches to the chinese-collation branch?

*On category pages, <label for="mw-collation-select">Sorting method:</label>
should have an id or class attribute so people could style it easily.

Actually, I guess its pretty easy to style via #mw-collation-selector label

For completeness's sake: Liangent will be meeting up with the WMF Language team at Wikimania this year to go over what needs to be done/etc for this to go out. Please feel free to continue working on this before then, but there is no set deploy target date until after Wikimania.

Pasting feedback that was given by Pau Giner on 2013-05-24 after a request from Tim Starling.

"I made a review of the UI and provided some design ideas to solve potential issues. I'm not familiar with Chinese nor Chinese collation methods, so feel free to correct me if I made any wrong assumption in my analysis:

  • The use of technical linguistic term such as "collation" although correct may be confusing to regular users. "Sorting" seems a more common term that will allow to unify sorting-related options (more on this later).
  • The control breaks the heading layout in the current position. The line of the heading appears broken. To avoid this, I would move the selector below the heading line since the action affects the elements below the header.
  • Current ordering is communicated by the list itself, so we may consider making the selector more compact (e.g., using an icon with a clarification tooltip).
  • Not sure if this was considered, but if there is a collation method that is most commonly used it should be the default. It may be also interesting to remember which is the collation method the user selected last and use it as the default for the user.

I know that the specific purpose of the extension is to support Chinese collation, but my concern is that when combining many different extensions the resulting UI gets inconsistently crowded, making it hard to access the great functionality provided by each individual extension.

To avoid this, I would propose to create a unified entry point for sorting-like functionality that can be used consistently at different parts of the UI. I made a quick mockup to illustrate the idea: http://i.imgur.com/1uZD8nF.png "

  • The control breaks the heading layout in the current position. The line of

the heading appears broken. To avoid this, I would move the selector below the
heading line since the action affects the elements below the header.

Just as a note, it affects the elements below the next 3 headings, not just the heading it is beside.

Not sure if this was considered, but if there is a collation method that is
most commonly used it should be the default. It may be also interesting to
remember which is the collation method the user selected last and use it as the
default for the user.

Given that Liangent added a user preference for preferred sorting, this seems like a good idea to maybe make altering which sorting method was used change that preference. The only possible worry I would have is in the case of {{DEFAULTCOLLATION:...}} being specified, the interaction between remembering the user's last choice, and the collation being overridden on a per-page basis, might be unclear to the user. But I think that's a minor concern.

250055655 wrote:

content hidden as private in Bugzilla

Reedy has just refurbished https://gerrit.wikimedia.org/r/#/c/87288/ which cherry-picks the first commit from the branch onto master, and I think he's working on the following patches.

Tim, any chance of technical/performance review from you? :)

https://gerrit.wikimedia.org/r/#/c/87288/
Tim, any chance of technical/performance review from you? :)

@tstarling: ping? ^ Or anybody else available for this task (CC'ing welcome)?

Tim is probably the only person who understands all of the collation-related code (other than Liangent, who authored the patches we're trying to get reviewed), having written most of it. Brian Wolff and I also know a fair bit (but at least my knowledge is very much limited to the areas I worked on myself to finish Tim's work on Unicode collations). Brian even left some review comments here earlier (T46667#520743).

Also, I'm afraid some work (and testing again) might be required to get the branch to merge on current master.

Krenair renamed this task from Review, merge and deploy chinese-collation branch to Review and merge chinese-collation branch into master, configure Chinese wikis to use it.Sep 1 2015, 3:20 PM
Krenair set Security to None.
Krenair moved this task from Backlog to Blocked on development on the Wikimedia-Site-requests board.
Nikerabbit added subscribers: cscott, Nikerabbit.

Edit after completely misreading the task (collation, no converter): Language team has not worked on collations either. I know there has been some changes to collation recently which may make it harder to rebase the patch. I'm wondering whether this could be discussed in the developer summit?

@Nikerabbit Other than the Language Team, who do you propose should be the owner of such code (please note: many teams 'own' code they haven't written).

@greg I would like to second what @Nikerabbit has proposed i.e. talking about code ownership of similar projects during the DevSummit. It is natural to assume that the Language-team may be the best choice to take a decision, but we are currently operating with a different set of priorities and also perhaps lack expertise to work on this immediately.

@Arrbee right right. A talk about code ownership is a fine topic for the Dev Summit, and it fits well within the "technical debt" topic area (https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt). Would you be willing to make a proposal that might help you with this topic, generally?

@greg I created a barebones proposal at T147171 . I was originally thinking of an unconference session, but either works. It would be great iff someone other than me can add more information about specifics and problem statements. Thanks.

@liangent: Hi, I'm resetting the task assignee due to inactivity. Please feel free to reclaim this task if you plan to work on this - it would be welcome! Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for more information - thanks!

I'm willing to review the collation code if/when @tstarling and/or @liangent get these patches rebased and current again.

Change 87273 had a related patch set uploaded (by Winston Sung; author: Liangent):

[mediawiki/core@master] Make zh@collation=pinyin and zh@collation=stroke collations usable

https://gerrit.wikimedia.org/r/87273

I want to mention https://github.com/nbdd0121/MW-PinyinSort, not sure which one works better, the patch here or the existing extension.

I want to mention https://github.com/nbdd0121/MW-PinyinSort, not sure which one works better, the patch here or the existing extension.

I think the current patchset won't work.

Winston_Sung renamed this task from Review and merge chinese-collation branch into master, configure Chinese wikis to use it to Allow multiple collations in same site and set zh collations (default, 5-stroke, radical + strokes, Hanyu Pinyin, Mandarin Phonetic Symbols).Feb 26 2023, 11:21 AM
Winston_Sung renamed this task from Allow multiple collations in same site and set zh collations (default, 5-stroke, radical + strokes, Hanyu Pinyin, Mandarin Phonetic Symbols) to Allow multiple collations in same site and configure zh collations.
Winston_Sung updated the task description. (Show Details)