Page MenuHomePhabricator

Improve sorting on Swedish Wikipedia
Open, Needs TriagePublic

Description

The UCA sorting used on Swedish Wikipedia works well for most letters. But there are some exceptions:

  • Dotless lowercase "ı" sorts as a separate (lowercase) letter different from capital (dotless) "I". This would sort better as a variant of "I", like both dotted uppercase "İ" and lowercase "i". (In Swedish normal I is dotless uppercase and dotted lowercase, as in English.)
  • Ligatures "Œ" and "Æ" are sorted as "Ö" and "Ä". While this is sometimes used in Swedish sorting (especially for Æ in Norwegian names), big encyclopaedias like Nationalencyklopedin and Nordisk familjebok sort the ligatures as the letters they combine: "OE" and "AE".
  • "Ę" sorts as "Ä". It would be better to sort it as a variant of "E" (like "É" or "È").
  • "Ô" sorts as "Ö". It is almost only used in foreign (french language) names, that would be better to sort as a variant of "O".
  • Some other rarely used letters with their own headers in sv:Kategori:Sidor med specialtecken som titel look like they should be sorted as variants of other letters.

Event Timeline

Lejonel raised the priority of this task from to Needs Triage.
Lejonel updated the task description. (Show Details)
Lejonel subscribed.
Krenair subscribed.

I notice Swedish Wikisource is using a slightly different collation - https://phabricator.wikimedia.org/T48058#479784 has some info about the differences. I don't know how relevant it is to your suggestions.

Aklapper renamed this task from Sorting on Swedish Wikipedia to Improve sorting on Swedish Wikipedia.Jan 31 2016, 11:23 PM
Aklapper set Security to None.

We unfortunately dont have the greatest support for customizing the collations. These changes might be hard to do.

What about work with upstream CLDR to improve the uca-sv collation?

What about work with upstream CLDR to improve the uca-sv collation?

Well I am sure that upstream would appreciate feedback - keep in mind that changes upstream probably aren't going to be adopted by Wikimedia for a very long time (changing our version of libicu requires regenerating all sortkeys, which means we generally don't upgrade very often). The bug tracker for cldr issues is: http://unicode.org/cldr/trac/newticket

Making changes to upstream php/hhvm intl extension to allow tailorings of collations would probably be cool too.

I notice Swedish Wikisource is using a slightly different collation - https://phabricator.wikimedia.org/T48058#479784 has some info about the differences. I don't know how relevant it is to your suggestions.

I just double checked the differences on that: The only difference is the behaviour of V and W. collation=standard vs collation=reformed treats other characters the same way.