Page MenuHomePhabricator

Review and merge chinese-collation branch into master, configure Chinese wikis to use it
Open, NormalPublic

Description

Chinese collation is complex, not least because different Chinese-speaking regions have different customary collations. The KangXi order favoured by Unicode standards is rarely used in any region, except in dictionaries.

Liangent has prepared a core branch which allows multiple category collations to coexist on a single wiki, with selectable sort order on category pages. I helped develop the architecture.

One of the collations considered essential for Chinese wikis is a latin sort of the pinyin transliteration. We will need to upgrade to ICU 4.8 to support this collation.

This bug tracks the tasks needed for merge of the chinese-collation branch and the deployment of a suitable multi-collation configuration on the Chinese Wikipedia.


See Also: T45799: Allow for using language-specific collations for category sorting

Details

Reference
bz44667

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 1:34 AM
bzimport set Reference to bz44667.
tstarling created this task.Feb 5 2013, 6:28 AM

Tagging with "design" keyword, since there's a small dropdown that might use a little love.

greg added a comment.Jun 14 2013, 7:41 PM

According to Jared Zimmerman this morning, Pau did a design review of this recently and everything seemed ok (or, was going to be ok, or similar). Is that right, Liangent?

(In reply to comment #3)

According to Jared Zimmerman this morning, Pau did a design review of this
recently and everything seemed ok (or, was going to be ok, or similar). Is
that
right, Liangent?

Right

greg added a comment.Jun 15 2013, 11:39 PM

(In reply to comment #4)

(In reply to comment #3)

According to Jared Zimmerman this morning, Pau did a design review of this
recently and everything seemed ok (or, was going to be ok, or similar). Is
that
right, Liangent?

Right

Great, thanks for confirming, Liangent.

I'd like this bug to have the needed next steps in it; could you tell me what you think they are from your end, Liangent (and anyone else who sees this bug mail).

Would love to know what needs to be prioritized.

Reedy added a comment.Jun 27 2013, 9:53 PM

We have libicu48 version 4.8.1.1-3, so presumably nothing else needs doing to that extent.

(In reply to comment #5)

Would love to know what needs to be prioritized.

The merge commit needs the various merge conflicts fixing, and also rebasing again onto core, as it's at least 10 weeks old, if not 18 weeks or so

greg added a comment.Jun 27 2013, 11:20 PM

(In reply to comment #6)

We have libicu48 version 4.8.1.1-3, so presumably nothing else needs doing to
that extent.
(In reply to comment #5)

Would love to know what needs to be prioritized.

The merge commit needs the various merge conflicts fixing, and also rebasing
again onto core, as it's at least 10 weeks old, if not 18 weeks or so

Alright, then I'm assuming Pau is no longer actively working on this, so assigning to Liangent but that's only because they own the merge. Would love someone else on this CC: list to take a look at that merge.

I took a quick look through the code (I was using it to make a prototype of a feature idea: http://tools.wmflabs.org/bawolff/whichisbetter ). It works well. A couple things I noticed though (Note I did not read the code in depth):

*In Title::moveTo, the code seems to assume the cl_sortkey_prefix is the same for all collations. I do not think this is the case.
*When running update.php, the script runs updateCollation.php before doing the schema changes from your code, instead of after. (Really I think it should run it after all extension schema change, in case someone abuses the Collation framework, to make a collation that depends on a schema change). Arguably this issue was here before your code.
*From a language perspective, I think using the phrase "Sorting method" instead of "collation" for the message 'category-collation' would be better and less jargony. Some of the collation names ( 'Identity' ) are a bit jargony as well, but I guess that can't really be helped. We can't exactly use the word 'alphabetical', since they're all alphabetical.
*On category pages, <label for="mw-collation-select">Sorting method:</label> should have an id or class attribute so people could style it easily. Additionally I think it might look better with the css vertical-align: bottom.

I'd submit gerrit patches for some of these, but I'm kind of unclear how to do that/should I do that given I don't really understand how long-term feature branches in gerrit are supposed to work. Should I just submit new patches to the chinese-collation branch?

*On category pages, <label for="mw-collation-select">Sorting method:</label>
should have an id or class attribute so people could style it easily.

Actually, I guess its pretty easy to style via #mw-collation-selector label

greg added a comment.Jul 11 2013, 5:19 PM

For completeness's sake: Liangent will be meeting up with the WMF Language team at Wikimania this year to go over what needs to be done/etc for this to go out. Please feel free to continue working on this before then, but there is no set deploy target date until after Wikimania.

Pasting feedback that was given by Pau Giner on 2013-05-24 after a request from Tim Starling.

"I made a review of the UI and provided some design ideas to solve potential issues. I'm not familiar with Chinese nor Chinese collation methods, so feel free to correct me if I made any wrong assumption in my analysis:

  • The use of technical linguistic term such as "collation" although correct may be confusing to regular users. "Sorting" seems a more common term that will allow to unify sorting-related options (more on this later).
  • The control breaks the heading layout in the current position. The line of the heading appears broken. To avoid this, I would move the selector below the heading line since the action affects the elements below the header.
  • Current ordering is communicated by the list itself, so we may consider making the selector more compact (e.g., using an icon with a clarification tooltip).
  • Not sure if this was considered, but if there is a collation method that is most commonly used it should be the default. It may be also interesting to remember which is the collation method the user selected last and use it as the default for the user.

I know that the specific purpose of the extension is to support Chinese collation, but my concern is that when combining many different extensions the resulting UI gets inconsistently crowded, making it hard to access the great functionality provided by each individual extension.

To avoid this, I would propose to create a unified entry point for sorting-like functionality that can be used consistently at different parts of the UI. I made a quick mockup to illustrate the idea: http://i.imgur.com/1uZD8nF.png "

  • The control breaks the heading layout in the current position. The line of

the heading appears broken. To avoid this, I would move the selector below the
heading line since the action affects the elements below the header.

Just as a note, it affects the elements below the next 3 headings, not just the heading it is beside.

Not sure if this was considered, but if there is a collation method that is
most commonly used it should be the default. It may be also interesting to
remember which is the collation method the user selected last and use it as the
default for the user.

Given that Liangent added a user preference for preferred sorting, this seems like a good idea to maybe make altering which sorting method was used change that preference. The only possible worry I would have is in the case of {{DEFAULTCOLLATION:...}} being specified, the interaction between remembering the user's last choice, and the collation being overridden on a per-page basis, might be unclear to the user. But I think that's a minor concern.

What's the progress on this?

250055655 wrote:

content hidden as private in Bugzilla

Reedy has just refurbished https://gerrit.wikimedia.org/r/#/c/87288/ which cherry-picks the first commit from the branch onto master, and I think he's working on the following patches.

Tim, any chance of technical/performance review from you? :)

https://gerrit.wikimedia.org/r/#/c/87288/
Tim, any chance of technical/performance review from you? :)

@tstarling: ping? ^ Or anybody else available for this task (CC'ing welcome)?

Language-Team: Any idea who could review/+2 this?

Language-Team: Any idea who could review/+2 this?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 23 2015, 1:59 PM

Tim is probably the only person who understands all of the collation-related code (other than Liangent, who authored the patches we're trying to get reviewed), having written most of it. Brian Wolff and I also know a fair bit (but at least my knowledge is very much limited to the areas I worked on myself to finish Tim's work on Unicode collations). Brian even left some review comments here earlier (T46667#520743).

Also, I'm afraid some work (and testing again) might be required to get the branch to merge on current master.

Krenair renamed this task from Review, merge and deploy chinese-collation branch to Review and merge chinese-collation branch into master, configure Chinese wikis to use it.Sep 1 2015, 3:20 PM
Krenair set Security to None.
Krenair moved this task from Backlog to Blocked on development on the Wikimedia-Site-requests board.
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 28 2016, 2:26 AM
Danny_B removed a subscriber: Language-Team.
Nikerabbit removed a project: Language-Team.EditedSep 28 2016, 2:28 PM
Nikerabbit added subscribers: cscott, Nikerabbit.

Edit after completely misreading the task (collation, no converter): Language team has not worked on collations either. I know there has been some changes to collation recently which may make it harder to rebase the patch. I'm wondering whether this could be discussed in the developer summit?

Restricted Application added a subscriber: Cosine02. · View Herald TranscriptSep 28 2016, 2:28 PM
Nikerabbit updated the task description. (Show Details)Sep 28 2016, 2:32 PM
greg added a comment.Sep 29 2016, 4:52 PM

@Nikerabbit Other than the Language Team, who do you propose should be the owner of such code (please note: many teams 'own' code they haven't written).

Arrbee added a subscriber: Arrbee.Sep 29 2016, 5:33 PM

@greg I would like to second what @Nikerabbit has proposed i.e. talking about code ownership of similar projects during the DevSummit. It is natural to assume that the Language-team may be the best choice to take a decision, but we are currently operating with a different set of priorities and also perhaps lack expertise to work on this immediately.

greg added a comment.Sep 29 2016, 7:35 PM

@Arrbee right right. A talk about code ownership is a fine topic for the Dev Summit, and it fits well within the "technical debt" topic area (https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt). Would you be willing to make a proposal that might help you with this topic, generally?

@greg I created a barebones proposal at T147171 . I was originally thinking of an unconference session, but either works. It would be great iff someone other than me can add more information about specifics and problem statements. Thanks.

Cwek added a subscriber: Cwek.Nov 18 2016, 8:57 AM
mxn added a subscriber: mxn.Nov 10 2018, 10:18 PM