Page MenuHomePhabricator

Allow for using language-specific collations for category sorting
Closed, ResolvedPublic

Description

We should either bundle or generate on-demand first-letters-XX.ser files.

ICUCollation itself, as well as PHP's collation, accepts language strings like 'sv' or 'pl' just fine. However, lack of corresponding first-letters file causes a cryptic exception.

If the 'root' file is copied, sorting works correctly for given language, only the headings are incorrect (default, not taking letters with diacritics like Ø or Ą into account).

The files can probably depends on Unicode tailoring data to add additional letters to create subheading for in categories. http://developer.mimer.com/charts/tailorings.htm looks like a good starting point.


Version: 1.21.x
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=44667
https://bugzilla.wikimedia.org/show_bug.cgi?id=45522

Details

Reference
bz43799

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:31 AM
bzimport set Reference to bz43799.

Ive been reading up on the collation stuff (specifically uts#10 and uts#35). It seems like the best course of action is instead of generating huge first letter files for every locale (probably would need about 200 such files with that approach), use the root first letters as a base. Then for a specific locale take the index examplar characters for that locale (from cldr). If the thing we are sorting falls between the first and last index letter we use the index letter as the first letter header otherwise use the info from first-letter-root.ser.

This would probably be best accomplished by merging the index letters with root first letters during the sorting step in icucollation that happens just before things get cached.

Then whitelist those languages first. Chinese doesn't work in this way AFAIK.

(In reply to comment #2)

Then whitelist those languages first. Chinese doesn't work in this way AFAIK.

Yes of course. We would definitely need testing here to see where this approach works and where it doesn't.

Removing the "Bundle or generate on-demand first-letters-XX.ser files" part from summary, as on second though this seems like not the best way to do this.

We should probably just store collation tailorings as "adjustments" to the -root file.

Marking this as fixed.

While there is still a lot that could be done, this patch provides basic and pretty solid support for 67 languages using latin, cyrillic and greek alphabets.

Similar bug about Chinese collations: bug 44667.

(In reply to comment #6)

While there is still a lot that could be done, this patch provides basic and
pretty solid support for 67 languages using latin, cyrillic and greek
alphabets.

It would be very nice if you could add to [[mw:MediaWiki 1.21]] a new section explaining what concretely changes effective now, and what else needs to be done on future releases or by local installations.
Release notes and commit message doesn't say anything clear, and http://thread.gmane.org/gmane.science.linguistics.wikipedia.technical/67769 mentions the need for 1) additional deployments, 2) configurations, 3) changes that look like MessagesXx.php variables and 4) maintenance/updateCollation.php... all mixed up, so I'm rather confused.
This looks like a big improvement, so we need to involve many more people for the follow-ups.

250055655 wrote:

content hidden as private in Bugzilla

250055655 wrote:

content hidden as private in Bugzilla

250055655 wrote:

content hidden as private in Bugzilla