Page MenuHomePhabricator

Investigation: Figure out how to switch as many wikis as possible over to uca-xx collations
Closed, ResolvedPublic3 Estimated Story Points

Description

In order to tackle T47443, we should see if it's possible to switch all wikis over to uca-xx collations en masse, rather than doing them one at a time.

Questions:

  • Are there collation settings available for all the wiki languages?
  • Are there any that we need to explicitly not do (like German)?
  • What are the steps involved in updating the collations for all the wikis? (does not include switching to numerical sorting yet)

Event Timeline

DannyH triaged this task as Medium priority.May 24 2016, 5:28 PM
DannyH edited projects, added Community-Tech-Sprint; removed Community-Tech.
DannyH set the point value for this task to 3.
kaldari raised the priority of this task from Medium to Needs Triage.May 24 2016, 5:28 PM
kaldari triaged this task as Medium priority.

It looks like ICU doesn't have support for all wiki languages, but there is a "uca-default" collation as well.

In order for a collation to work in MediaWiki, it must be a collation that exists in the CLDR dataset (http://unicode.org/cldr/trac/browser/trunk/common/collation). It also has to have made the migration from CLDR to the ICU libraries to PHP's intl extension (which could take a few years). Luckily, there is a snapshot of the CLDR list from 2012 that we can refer to: http://stackoverflow.com/questions/9422553/list-of-available-collators-in-php. These languages are almost certainly supported by PHP for doing localized collation.

  • Are there collation settings available for all the wiki languages?

No, only languages listed at http://stackoverflow.com/questions/9422553/list-of-available-collators-in-php are likely to be supported. However, other languages should be able to use "uca-default".

  • Are there any that we need to explicitly not do (like German)?

There's no way to know this without asking each project separately. For most languages directly supported by PHP collation, switching to uca-xx should be a clear improvement. It is not clear, however, whether switching unsupported languages to uca-default will always be an improvement. At the very least, it will let those projects use natural number sorting (T8948).

  • What are the steps involved in updating the collations for all the wikis? (does not include switching to numerical sorting yet)

If we want to switch them all, we would change wgCategoryCollation.default from "uppercase" to "uca-default" in InitializeSettings.php. Any projects that wanted to opt-out (like German) would then need to be explicitly set to "uppercase". Languages that are directly supported by PHP's intl collation should be set to uca-xx (where xx is the language code).

The next step would be to ask on Wikitech-l about switching all wikis to uca-default by default.

I just left a note at T32996#2336528 regarding the English Wikipedia proposal specifically, though a similar question applies to other wikis.

I also created https://meta.wikimedia.org/wiki/Collation.

(The Community-Tech project was removed in T136113#2323690 hence not sure why it was readded to this task)

Community-Tech-Sprint is basically a sub-board of Community-Tech. No need to include it on both boards.

Questions:

  • Are there collation settings available for all the wiki languages?

No, collations are only available for the languages listed at P3231.

  • Are there any that we need to explicitly not do?

Yes, German wikis should be excluded until they have decided how to deal with their existing defaultsort hacks.

  • What are the steps involved in updating the collations for the wikis? (does not include switching to numerical sorting yet)

First, support for the language must be added to IcuCollation::$tailoringFirstLetters. Even if no changes are needed, an array for the language must exist there. The array for the language should include all characters that are tailored with primary weight differences (refer to files in https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll). After that is deployed, the wiki's collation can be changed to a uca-xx collation by setting it in the $wgCategoryCollation global variable in wmf-config/InitializeSettings.php (operations/mediawiki-config repo).

Community-Tech-Sprint is basically a sub-board of Community-Tech. No need to include it on both boards.

Hard to find all CT tasks then...