Page MenuHomePhabricator

Make the "uca-xx" category collations the default during installation (with fallbacks)
Open, MediumPublic3 Estimated Story Points

Description

We should add support for language specific UCA collations to the installer (and make them the default for non-English languages). For example, if someone sets their wiki's language to French in the first step of the installer and they have the intl extension installed, it should set the collation to "uca-fr-u-kn". If they don't have the intl extension installed, it should set the collation to "numeric". It will need to check and see if the language is supported by MediaWiki's IcuCollation class (See https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation#Language-specific_collations), otherwise, it should use "uca-default-u-kn".

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:38 AM
bzimport set Reference to bz45611.
bzimport added a subscriber: Unknown Object (MLST).

CC-ing everyone who filed/commented on the bugs about deploying uca-xx collations on Wikimedia wikis. Any input, guys/gals? :)

I would go with selectable in the installer (maybe even being the default). However there are 2 issues I would like to see fixed before doing that

  1. the issue with prefix collisions in the first-letters-root.ser file (there is a bug for this)
  2. I think we should probably add a cl_collation_version field to categorylinks table. If someone upgrades php the uca version changes and everything breaks. At the very least update.php should fix this. Atm one needs to run updateCollation.php --force. The force should not be neccesary (plus a null edit to a page should fix category links that are broken in such a fashion)

Note: we probably don't want this to be tied to the wikis content lang as that may break on upgrade from older mw (otoh update.php would fix this) and people probably don't expect that changing the lang code would cause such a disruptive change.

(In reply to comment #2)

the issue with prefix collisions in the first-letters-root.ser file (there

is a bug for this)

Bug 43740.

(In reply to comment #2)

I think we should probably add a cl_collation_version field to

categorylinks
table. If someone upgrades php the uca version changes and everything breaks.
At the very least update.php should fix this. Atm one needs to run
updateCollation.php --force. The force should not be neccesary (plus a null
edit to a page should fix category links that are broken in such a fashion)

Good point. I wonder how this related to the 'chinese-collation' branch and Liangent's support for using multiple collations at once? (bug 44667)

Note: we probably don't want this to be tied to the wikis content lang as
that
may break on upgrade from older mw (otoh update.php would fix this) and
people
probably don't expect that changing the lang code would cause such a
disruptive
change.

As above, this might actually sort of "just work" if we support using multiple collations at once. If we can't get it to, though, then yes, it can't depend on $wgLanguageCode.

Well, but what languages do need to change collation? Does it need for English or Russian? And do we have collations for all languages?

(In reply to comment #4)

Well, but what languages do need to change collation? Does it need for
English or Russian?

Sort of. While the "native" letters for both of these sort correctly by default (with the 'uppercase' collation), their accented variants are placed at the very end of category page listings, which might be undesirable.

I know that the English Wikipedia uses {{DEFAULTSORT: hacks to enforce behavior similar to what the UCA collations do (by sorting by the article title with all accents removed); I don't know what is done in other languages.

And do we have collations for all languages?

No (not yet :) ). 67 languages are supported now (including I think all major European ones, there's a list at the bottom of [[mw:Manual:$wgCategoryCollation]]; more could be added if only someone did this), and Liangent is working on collations for Chinese.

(In reply to comment #3)

(In reply to comment #2)

I think we should probably add a cl_collation_version field to

categorylinks
table. If someone upgrades php the uca version changes and everything breaks.
At the very least update.php should fix this. Atm one needs to run
updateCollation.php --force. The force should not be neccesary (plus a null
edit to a page should fix category links that are broken in such a fashion)

Good point. I wonder how this related to the 'chinese-collation' branch and
Liangent's support for using multiple collations at once? (bug 44667)

So if cl_collation_version looks outdated, we update sortkey automatically for that entry? I guess updating sortkeys partially (only for entries we're reading) breaks category pages more before sysadmins run updateCollation.php to update them fully.

(In reply to comment #5)

(In reply to comment #4)

And do we have collations for all languages?

No (not yet :) ). 67 languages are supported now (including I think all major
European ones, there's a list at the bottom of
[[mw:Manual:$wgCategoryCollation]]; more could be added if only someone did
this), and Liangent is working on collations for Chinese.

It's (almost, except for what can't be done easily now due to external dependency) done and pending review.

kaldari renamed this task from Make the "uca-xx" category collations the default? (selectable directly in the installer?) to Make the "uca-xx" category collations the default?.Sep 20 2016, 10:33 PM
kaldari updated the task description. (Show Details)
kaldari set Security to None.
kaldari renamed this task from Make the "uca-xx" category collations the default? to Make the "uca-xx" category collations the default and selectable in the installer.Sep 21 2016, 1:42 AM

I've tried to clarify the scope and requirements of this task by breaking off a child task (T146225) and rewriting the description. Please let me know if anything sounds wrong or could be improved.

I can't really think of any case where using a non-UCA collation or a non-numeric collation would be desirable on any new wiki (except when the 'intl' extension is missing). We can't change the default for existing wikis (even if we made the updater run the updateCollation.php script, the wiki might have DEFAULTSORT conventions that wouldn't work well with a different collation), but I think we should promote UCA with numeric as the default.

So, I think the installer should only provide the following options, with the following descriptions:

ValueDescriptionNote
uca-xx-u-knLanguage-specific collation for {{language:xx}}Only available if intl installed and language xx has a collation
uca-default-u-knUniversal collation, best for multilingual wikisOnly available if intl installed
uppercaseFallback case-insensitive collation

In the rare case where a different collation is needed it can still be changed in LocalSettings afterwards. The installer doesn't create any categories, so no maintenance script would be needed.

DannyH set the point value for this task to 3.Sep 22 2016, 5:38 PM
DannyH moved this task from Needs Discussion to Up Next (May 6-17) on the Community-Tech board.

Legoktm has recommended against adding an interface for setting the collation in the installer (T146225#2676705), so instead we will just concentrate on setting an optimal default during installation.

kaldari renamed this task from Make the "uca-xx" category collations the default and selectable in the installer to Make the "uca-xx" category collations the default during installation (with fallbacks).Dec 14 2016, 9:47 PM
kaldari updated the task description. (Show Details)

Change 327762 had a related patch set uploaded (by Niharika29):
[WIP] Make language-appropriate collations default during install

https://gerrit.wikimedia.org/r/327762