We should add support for language specific UCA collations to the installer (and make them the default for non-English languages). For example, if someone sets their wiki's language to French in the first step of the installer and they have the intl extension installed, it should set the collation to "uca-fr-u-kn". If they don't have the intl extension installed, it should set the collation to "numeric". It will need to check and see if the language is supported by MediaWiki's IcuCollation class (See https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation#Language-specific_collations), otherwise, it should use "uca-default-u-kn".
Description
Details
- Reference
- bz45611
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Make language-appropriate collations default during install | mediawiki/core | master | +22 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T32672 Use locale-specific sorting (tracking) | |||
Open | None | T32754 Use correct sorting for in prefix searches | |||
Open | None | T32673 Implement central locale-specific, or tailored, sorting framework (tracking) | |||
Open | None | T47611 Make the "uca-xx" category collations the default during installation (with fallbacks) | |||
Declined | None | T146225 Add ability to set category collation from the installer | |||
Open | None | T146341 ICU collations should have the ICU version number stored with the name | |||
Open | None | T158724 Increase size of categorylinks.cl_collation column | |||
Resolved | None | T45740 IcuCollation doesn't prune first letter elements that duplicate a prefix of another first letter's sortkey | |||
Declined | None | T45802 Do not bundle first-letters-root.ser with MediaWiki since it's only valid for one version of ICU | |||
Resolved | matmarex | T45804 Docs about files needed by generateCollationData.php should be updated | |||
Resolved | matmarex | T45801 ICUCollation needs to know the version of ICU library |
Event Timeline
CC-ing everyone who filed/commented on the bugs about deploying uca-xx collations on Wikimedia wikis. Any input, guys/gals? :)
I would go with selectable in the installer (maybe even being the default). However there are 2 issues I would like to see fixed before doing that
- the issue with prefix collisions in the first-letters-root.ser file (there is a bug for this)
- I think we should probably add a cl_collation_version field to categorylinks table. If someone upgrades php the uca version changes and everything breaks. At the very least update.php should fix this. Atm one needs to run updateCollation.php --force. The force should not be neccesary (plus a null edit to a page should fix category links that are broken in such a fashion)
Note: we probably don't want this to be tied to the wikis content lang as that may break on upgrade from older mw (otoh update.php would fix this) and people probably don't expect that changing the lang code would cause such a disruptive change.
(In reply to comment #2)
the issue with prefix collisions in the first-letters-root.ser file (there
is a bug for this)
Bug 43740.
(In reply to comment #2)
I think we should probably add a cl_collation_version field to
categorylinks
table. If someone upgrades php the uca version changes and everything breaks.
At the very least update.php should fix this. Atm one needs to run
updateCollation.php --force. The force should not be neccesary (plus a null
edit to a page should fix category links that are broken in such a fashion)
Good point. I wonder how this related to the 'chinese-collation' branch and Liangent's support for using multiple collations at once? (bug 44667)
Note: we probably don't want this to be tied to the wikis content lang as
that
may break on upgrade from older mw (otoh update.php would fix this) and
people
probably don't expect that changing the lang code would cause such a
disruptive
change.
As above, this might actually sort of "just work" if we support using multiple collations at once. If we can't get it to, though, then yes, it can't depend on $wgLanguageCode.
Well, but what languages do need to change collation? Does it need for English or Russian? And do we have collations for all languages?
(In reply to comment #4)
Well, but what languages do need to change collation? Does it need for
English or Russian?
Sort of. While the "native" letters for both of these sort correctly by default (with the 'uppercase' collation), their accented variants are placed at the very end of category page listings, which might be undesirable.
I know that the English Wikipedia uses {{DEFAULTSORT: hacks to enforce behavior similar to what the UCA collations do (by sorting by the article title with all accents removed); I don't know what is done in other languages.
And do we have collations for all languages?
No (not yet :) ). 67 languages are supported now (including I think all major European ones, there's a list at the bottom of [[mw:Manual:$wgCategoryCollation]]; more could be added if only someone did this), and Liangent is working on collations for Chinese.
(In reply to comment #3)
(In reply to comment #2)
I think we should probably add a cl_collation_version field to
categorylinks
table. If someone upgrades php the uca version changes and everything breaks.
At the very least update.php should fix this. Atm one needs to run
updateCollation.php --force. The force should not be neccesary (plus a null
edit to a page should fix category links that are broken in such a fashion)Good point. I wonder how this related to the 'chinese-collation' branch and
Liangent's support for using multiple collations at once? (bug 44667)
So if cl_collation_version looks outdated, we update sortkey automatically for that entry? I guess updating sortkeys partially (only for entries we're reading) breaks category pages more before sysadmins run updateCollation.php to update them fully.
(In reply to comment #5)
(In reply to comment #4)
And do we have collations for all languages?
No (not yet :) ). 67 languages are supported now (including I think all major
European ones, there's a list at the bottom of
[[mw:Manual:$wgCategoryCollation]]; more could be added if only someone did
this), and Liangent is working on collations for Chinese.
It's (almost, except for what can't be done easily now due to external dependency) done and pending review.
I've tried to clarify the scope and requirements of this task by breaking off a child task (T146225) and rewriting the description. Please let me know if anything sounds wrong or could be improved.
I can't really think of any case where using a non-UCA collation or a non-numeric collation would be desirable on any new wiki (except when the 'intl' extension is missing). We can't change the default for existing wikis (even if we made the updater run the updateCollation.php script, the wiki might have DEFAULTSORT conventions that wouldn't work well with a different collation), but I think we should promote UCA with numeric as the default.
So, I think the installer should only provide the following options, with the following descriptions:
Value | Description | Note |
---|---|---|
uca-xx-u-kn | Language-specific collation for {{language:xx}} | Only available if intl installed and language xx has a collation |
uca-default-u-kn | Universal collation, best for multilingual wikis | Only available if intl installed |
uppercase | Fallback case-insensitive collation | |
In the rare case where a different collation is needed it can still be changed in LocalSettings afterwards. The installer doesn't create any categories, so no maintenance script would be needed.
Legoktm has recommended against adding an interface for setting the collation in the installer (T146225#2676705), so instead we will just concentrate on setting an optimal default during installation.
Change 327762 had a related patch set uploaded (by Niharika29):
[WIP] Make language-appropriate collations default during install