Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• brooke
	Sep 19 2011, 3:54 PM

Description

According to Roan in T32287: Implement uca-fa collation comment 9, actually enabling the uca-default collation stuff that was "fixed" for T2164: Support collation by a certain locale (sorting order of characters) is waiting on an Ubuntu upgrade on the apache cluster (T31915: Upgrade the WMF-cluster >= Ubuntu 10.04?).

There are a few bugs which it looks like should be resolved (for Categories at least) by enabling this -- eg T32287: Implement uca-fa collation (Farsi sorting problems); others require further work (T31788: Swedish-language wikis should use Swedish-locale sorting (ie. ÅÄÖ should sort correctly) needs a Swedish-specific collation setting).

Details

Reference: bz30996

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		hashar	T119138 [keyresult] Migrate majority of CI jobs to Nodepool (part 2)
Resolved		hashar	T119139 [keyresult] Migrate php (Zend and HHVM) CI jobs to Nodepool
Resolved		Joe	T125821 Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia
Resolved		kaldari	T128483 Fix category headers for pages that begin with numbers
Resolved		kaldari	T8948 Natural number sorting in category listings
Resolved		Legoktm	T75901 Drop PHP 5.3 support
Declined		• demon	T91590 [Spike] Try out hack (<?hh) for mediawiki-config
Resolved		Joe	T104147 can we get rid of rsvg security patch?
Resolved		Reedy	T94149 Get rid of Zend 5.5 tests for wmf branches
Resolved		None	T86081 Complete the use of HHVM over Zend PHP on the Wikimedia cluster
Resolved		None	T88088 Incorrect sorting in categories on Russian-language projects
Resolved		Joe	T129411 Run `php maintenance/updateCollation.php --force` on all Russian-language projects using uca-ru collation
Resolved		None	T131748 Refresh the appservers puppet code/configs
Resolved		Joe	T131749 Make all role::mediawiki::* classes compatible with debian jessie
Resolved		None	T136281 Broken sorting and multi-page categories for Cyrillic wikis
Open		None	T32672 Use locale-specific sorting (tracking)
Open		None	T32754 Use correct sorting for in prefix searches
Open		None	T32673 Implement central locale-specific, or tailored, sorting framework (tracking)
Open		None	T32996 Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki
Open		None	T30397 Allow collation to be specified per category
Open	Feature	None	T37378 Support multiple collations at the same time
Open		None	T46667 Allow multiple collations in same site and configure zh collations
Open	Feature	None	T223750 Include pinyin for zhwiki damaging model
Open	Feature	None	T170049 Chinese site category pages should be able to sort pages by Hanyu Pinyin
Resolved		tstarling	T37632 Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Portuguese Wikipedia
Resolved		matmarex	T48081 Set $wgCategoryCollation to 'uca-default' on Polish Wiktionary and rebuild category sort keys
Resolved		None	T50097 Request Sorting Thai Wikipedia (and sister projects) with UCA
Resolved		None	T45185 Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Portuguese Wikibooks
Open	Feature	None	T47443 Deploy language-specific "uca-xx" collations on Wikimedia wikis
Resolved		None	T44413 Set $wgCategoryCollation to 'uca-pl' on Polish Wikipedia and rebuild category sort keys
Resolved		None	T48058 sorting order in categories for sv.wikisource
Resolved	Feature	Ebrahim	T48235 Kurdish Wikipedia: Alphabetical order in the categories (collation)
Resolved		matmarex	T48330 Set $wgCategoryCollation to 'uca-fi' on Finnish wikis and rebuild category sort keys
Resolved		None	T54015 Change ckb wiki to use uca-ckb sort order
Resolved		matmarex	T47444 Set $wgCategoryCollation to 'uca-uk' on Ukrainian Wikipedia and rebuild category sort keys
Resolved		matmarex	T47446 Set $wgCategoryCollation to 'uca-sv' on Swedish Wikipedia and rebuild category sort keys
Resolved		None	T52311 Set collation on fa wikis to uca-fa
Resolved		None	T32287 Implement uca-fa collation
Resolved		Reedy	T47525 beta: set $wgCategoryCollation for languages
Resolved		matmarex	T47596 Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild category sort keys
Resolved		matmarex	T47776 Set $wgCategoryCollation to 'uca-uk' on Ukrainian wikis (other than Wikipedia) and rebuild category sort keys
Resolved		matmarex	T43040 Proper collation support in categories for Ukrainian wikis
Resolved		None	T56168 Category collation for Estonian projects
Resolved		matmarex	T47911 Set $wgCategoryCollation to 'uca-pt' on Portuguese Wikipedia and Wikibooks and rebuild category sort keys
Resolved		matmarex	T47968 Set $wgCategoryCollation to 'uca-pl' on Polish Wikivoyage and rebuild category sort keys
Resolved		matmarex	T44412 Implement locale-specific sorting for Polish language
Resolved		None	T47970 Make updateCollation.php process categorylinks on a category-by-category basis
Resolved		matmarex	T47979 Set $wgCategoryCollation to 'uca-vi' on all Vietnamese wikis and rebuild category sort keys
Resolved		matmarex	T48004 Set $wgCategoryCollation to 'uca-be' on be.wiki and be.wikisource and rebuild category sort keys
Resolved		matmarex	T48005 Set $wgCategoryCollation to 'uca-be-tarask' on be-x-old.wiki and rebuild category sort keys
Resolved		tstarling	T48036 Upgrade to ICU 4.8 for WMF
Resolved		None	T31915 Upgrade the WMF-cluster >= Ubuntu 10.04
Resolved		tomasz	T56680 Set $wgCategoryCollation to 'uca-fr' on the French Wikipedia and rebuild category sort keys
Resolved		Reedy	T66885 Set $wgCategoryCollation to 'uca-cs' on Czech Wikipedia and rebuild category sort keys
Resolved		None	T61800 Welsh Alphabet on Categories
Resolved		Reedy	T67003 Set $wgCategoryCollation to 'uca-lv' on the Latvian Wikipedia and rebuild category sort keys
Resolved		Reedy	T58859 Set $wgCategoryCollation to 'uca-is' on Icelandic Wikipedia and rebuild category sort keys
Resolved		Reedy	T69287 $wgCategoryCollation for sh.wikipedia.org
Resolved		Reedy	T74513 Set wgCategoryCollation to uca-fr on frwikibooks
Resolved		Reedy	T68165 $wgCategoryCollation modification for the French Wikinews
Resolved		Reedy	T71782 enable CategoryCollation on frwikiversity
Resolved		kaldari	T75453 Tamil sort order
Resolved		Glaisher	T86821 Set $wgCategoryCollation to 'uca-pl' on Polish Wikisource and rebuild category sort keys
Resolved		tomasz	T90689 Set $wgCategoryCollation to 'uca-hsb' on Upper Sorbian Wikipedia (hsb.wp) and rebuild category sort keys
Resolved		Joe	T86096 Switch HAT appservers to trusty's ICU (or newer)
Resolved		Joe	T78765 Convert jobrunners to HHVM
Resolved		Joe	T84842 Convert eqiad imagescalers to HHVM, Trusty
Resolved		tstarling	T91468 HHVM with FastCGI does not support streaming output
Resolved		Joe	T93194 Create an HHVM 3.6.0 package, adding Tim's streaming patch
Declined		None	T128806 Switch German Wikipedia to uca-de category collation
Resolved		kaldari	T58041 updateCollation.php script prohibitively slow for very large wikis
Resolved		jcrespo	T130692 Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases
Resolved		Volans	T128353 Switchover to new s3 master
Resolved	PRODUCTION ERROR	aaron	T126436 Spikes of mediawiki in read only for job runners after altering the s2 slaves topology
Invalid		None	T126632 Scap should restart job runners to pick up new config
Declined		None	T165202 Prepare a bot to simplify sortkeys at de.wiki
Resolved		kaldari	T136113 Investigation: Figure out how to switch as many wikis as possible over to uca-xx collations
Resolved		Niharika	T136150 Switch English Wikipedia to uca-default collation
Resolved		Johan	T144081 Notify English Wikipedia of switch to uca-default collation
Duplicate		None	T144580 updateCollation.php on terbium still run code from 1.28.0-wmf.16 against enwiki ( LoadBalancer::reallyOpenConnection: 402+ connections made (master=db1057) LoadBalancer.php line 850 )
Open		None	T144634 Investigation: New sort order: Hyphenated words should be sorted lower than the prefix
Resolved		Dereckson	T136647 Set UCA-IT as it.wiki's collation
Resolved		• DannyH	T144840 Investigation: Numerical sorting for more wikis
Resolved		kaldari	T146675 Convert more wikis to numerical sorting
Resolved		kaldari	T148873 Bugs with numerical sorting on Bengali
Resolved		kaldari	T148885 Add support for Bengali to IcuCollation class
Resolved		Johan	T148488 Figure out if no.wikipedia.org wants UCA
Resolved		Niharika	T148682 Convert wikis to numerical sorting batch #3
Resolved		Niharika	T148749 Set $wgCategoryCollation to 'uca-hr' on Croatian Wikipedia and rebuild category sort keys
Resolved		Amire80	T162823 Changing the alphabetical sorting (collation) @ ba.wikipedia.org
Resolved		Strainu	T168711 Changing the alphabetical sorting (collation) @ ro.wikipedia.org
Stalled		None	T176434 set $wgCategoryCollation to uca-th on th lang projects
Resolved		jhsoby-WMNO	T182431 Switch category collation for sewiki to uca-se-u-kn
Resolved		None	T189295 ICU 57 migration for wikis using non-default collation
Resolved		Quiddity	T189486 Announcing ICU 57 transition to the community
Resolved		Ladsgroup	T190965 Remove uca-fa from beta cluster
Resolved		matmarex	T183430 Changing the alphabetical sorting (collation) @ ab.wikipedia.org (Abkhaz Wikipedia)
Resolved		MarcoAurelio	T183802 Set uca-es-u-kn as category collation for es.wikipedia
Resolved		jhsoby-WMNO	T181503 Add proper category collation for the Northern Sami Wikipedia

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:51 PM

• bzimport added projects: Wikimedia-Site-requests, Tracking-Neverending.

• bzimport set Reference to bz30996.

• bzimport added a subscriber: Unknown Object (MLST).

• brooke created this task.Sep 19 2011, 3:54 PM

Closing LATER until apaches are all upgraded

Relevant dependencies as RT tickets:

http://rt.wikimedia.org/Ticket/Display.html?id=22 full update to Lucid (bug 29915)

http://rt.wikimedia.org/Ticket/Display.html?id=652 install icu & php5-intl (depends on the above)

py wrote:

rt 22 and 652 are done. this can probably be closed.

(In reply to comment #3)

rt 22 and 652 are done. this can probably be closed.

Well this still needs someone to make the changes to MediaWiki's config file and run the maintenance script.

The first letter identification code (maintenance/language/generateCollationData.php) won't work for all languages, so some wikis will have their category pages broken terribly by this change. Also, the default collation tables sort a lot of languages incorrectly, and the amount of breakage that causes will depend on the language in question. So I recommend doing this change on a language-by-language basis, after checking each language for correct collation and first-letter behaviour on a test wiki.

Also, it would be nice to know in advance what percentage of sort keys will be larger than the 230 bytes allowed by the database field, and if that percentage is significant, whether there are categories on the target wikis where the order will be changed by truncation after 230 bytes.

Any progress on this?

On Portuguese Wikipedia we still need to use
{{DEFAULTSORT: Page Name without accents }}
on any article whose title has an accent if we want it to be sorted appropriately in the categories. E.g.:
https://pt.wikipedia.org/w/index.php?title=%C3%81gua_Boa&oldid=28441112&action=edit

Maybe adding a note to [[mw:Roadmap]] would be appropriated?

Some related info:

I created some collations for Chinese and is expected to be used on zhwiki. This code requires ICU 4.8+ to run. Current php5-intl in WMF's APT repo uses libicu42 and existing wikis with uca-default (ptwiki) have sort keys generated with libicu42. Once libicu is updated all existing uca-default sort keys need to be rebuilt.

Btw meta, and especially commons may be good next targets for deploying uca-default to. Both are multilingual so using the root coallation seems ideal

'wgCategoryCollation' => array(
'default' => 'uppercase',
'ptwiki' => 'uca-default', # bug 35632
'iswiktionary' => 'identity', # bug 30722
),

I'm presuming this is fixed now...

(In reply to comment #9)

'wgCategoryCollation' => array(
'default' => 'uppercase',
'ptwiki' => 'uca-default', # bug 35632
'iswiktionary' => 'identity', # bug 30722
),

I'm presuming this is fixed now...

Umm only for ptwiki.

Just to clarify this bug-we probably should *not* do this for all wikis. As tim said above, more mw code is needed to make it work properly.

However this can (and should imo) be done on all english, portugese, and multilingual (meta and commons) wikis

I guess, a rough list for this would be:

reedy@fenari:/home/wikipedia/common$ grep enw all.dblist
arbcom_enwiki
enwiki
enwikibooks
enwikinews
enwikiquote
enwikisource
enwikiversity
enwikivoyage
enwiktionary
tenwiki
wg_enwiki
reedy@fenari:/home/wikipedia/common$ grep ptw all.dblist
ptwiki
ptwikibooks
ptwikinews
ptwikiquote
ptwikisource
ptwikiversity
ptwikivoyage
ptwiktionary

+brwikimedia

reedy@fenari:/home/wikipedia/common$ cat special.dblist
advisorywiki
arbcom_dewiki
arbcom_enwiki
arbcom_fiwiki
arbcom_nlwiki
auditcomwiki
boardgovcomwiki
boardwiki
chairwiki
chapcomwiki
checkuserwiki
collabwiki
commonswiki
donatewiki
execwiki
fdcwiki
foundationwiki
grantswiki
incubatorwiki
internalwiki
mediawikiwiki
metawiki
movementroleswiki
nostalgiawiki
officewiki
otrs_wikiwiki
outreachwiki
qualitywiki
searchcomwiki
sourceswiki
spcomwiki
specieswiki
stewardwiki
strategywiki
tenwiki
test2wiki
testwiki
usabilitywiki
wg_enwiki
wikimania2005wiki
wikimania2006wiki
wikimania2007wiki
wikimania2008wiki
wikimania2009wiki
wikimania2010wiki
wikimania2011wiki
wikimania2012wiki
wikimania2013wiki
wikimaniateamwiki
wikidatawiki

Do the rest of the is projects want to become identity too?

reedy@fenari:/home/wikipedia/common$ grep isw all.dblist
iswiki
iswikibooks
iswikiquote
iswikisource
iswiktionary

(In reply to comment #13)

Do the rest of the is projects want to become identity too?

reedy@fenari:/home/wikipedia/common$ grep isw all.dblist
iswiki
iswikibooks
iswikiquote
iswikisource
iswiktionary

I would imagine so. The language is case sensitive from what I understand. I guess we should ask.

Realistically it doesnt matter that much for a wiki like wikimania2006 since nobody is using them. Although it certainly wouldn't hurt anything.

For larger wikis (where it would take more than a couple hours to run the script) we would probably want to talk to the local community as categories will behave somewhat weirdly when the script is running. ( pages will be out of order while the script is running) its too bad the script doesnt go in order of cl_to instead of cl_from as that would minimize disruption somewhat.

Adjusting the summary: "Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Wikimedia wikis deployment" -> "Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki".

Per bug 45443, we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta), but language-specific collations.

Krenair moved this task from Backlog to Tracking on the Wikimedia-Site-requests board.Apr 10 2015, 11:29 AM

Restricted Application added subscribers: Josve05a, Matanya, Aklapper. · View Herald TranscriptOct 18 2015, 1:58 PM

Quiddity mentioned this in T120854: Investigation: Numerical sorting in categories.Jan 22 2016, 8:17 PM

Ricordisamoa subscribed.Jan 22 2016, 8:48 PM

Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJan 22 2016, 8:48 PM

Meno25 unsubscribed.Feb 8 2016, 7:45 PM

• MZMcBride subscribed.May 28 2016, 1:32 PM

In T32996#350751, @matmarex wrote:

Adjusting the summary: "Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Wikimedia wikis deployment" -> "Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki".

Per bug 45443, we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta), but language-specific collations.

At https://en.wikipedia.org/wiki/Wikipedia_talk:Categorization#OK_to_switch_English_Wikipedia.27s_category_collation_to_uca-default.3F @kaldari has proposed using "uca-default"; are you saying we should be using "uca-en" or something?

• MZMcBride mentioned this in T136113: Investigation: Figure out how to switch as many wikis as possible over to uca-xx collations.May 28 2016, 1:38 PM

• MZMcBride mentioned this in T136150: Switch English Wikipedia to uca-default collation.May 28 2016, 1:42 PM

Uca-en and uca-default are the same

I think mediawiki has some code where it tries to force you to use uca-default over uca-en

Hmm, "uca-en" might be a bit neater, but it is indeed probably the exact same thing as "uca-default" (I haven't tried to check this).

Danny_B subscribed.May 28 2016, 2:33 PM

Danny_B updated the task description. (Show Details)May 28 2016, 2:37 PM

Danny_B removed a subscriber: • wikibugs-l-list.

...we don't really want uca-default anywhere anymore (apart from multi-language projects like Commons or Meta)

@matmarex: Why is that? It seems that most languages do not have language-specific uca-collations yet. Wouldn't it be better to switch them to uca-default rather than uppercase collation?

Maybe? Probably not? You can't tell without researching each language a little bit (or asking a native speaker). At least for languages using the Latin or Cyrillic scripts with additional letters with diacritics it is not always a good idea, since letters with diacritics might need to be ordered differently than the basic versions.

For example in Polish, ordering "L" and "Ł" as if they were the same letter is just as wrong as ordering "Ł" at the end of the alphabet, and in my opinion more confusing (as people are already familiar with the usual broken ordering). [All Polish-language wikis already have the correct uca-pl ordering deployed, this is just an example.]

(To provide another entertaining example with no diacritics involved: "CH" in the Czech alphabet is sorted between "H" and "I".)

What's actually "entertaining" on that?

w:Ch (digraph) states that it is treated as a letter of its own but not anymore commonly used for collation purposes.

Well, the proper adjective would be perhaps "interesting" or "important to bear in mind" then...
I wouldn't dare to say, that german alphabet is "entertaining" because of having ß or whichever other alphabet because of whatever reason. Alphabets are long existing parts of national cultures and have reasons why they have developped to the forms they are in nowadays. (Cf. Czech alphabet having "ú" & "ů" both for marking IPA [u:], and it has its reasons.)
Please weigh your words in such cases next time, thank you.

Indeed, random other alphabets also have entertaining aspects. (I consider human languages highly entertaining in general). My point was merely that it's not only about the sort order of single letters, but also digraphs making things more complex.

@matmarex: I'm sure there are lots of cases where uca-default isn't an accurate collation for the language, but are there any cases where it's actually worse than uppercase collation? At least with uca-default you can have numeric sorting (T8948), which is a highly requested feature from the community. Regarding Cyrillic, it looks like a lot of the Cyrillic languages have already switched over: Belarusian, Serbian, Russian, Ukrainian. Bulgarian isn't switched over, but there is a uca-bg collation available.

I honestly don't know. I'm not a linguist. My personal opinion about Polish is that 'uca-default' is worse than the simple 'uppercase' collation (and objectively, it is definitely very wrong, but which kind of wrongness one prefers might differ). I would be wary of just switching everyone to uca-default, since all the accents/diacritics it ignores are sometimes part of cultural identity and this could rub people the wrong way (I have no specific examples at the moment, sorry).

If it is just about numeric sorting, that would be easy enough to implement on top of 'uppercase'. I've been thinking about this on-and-off for weeks already, so right now I could write a proof-of-concept in a couple hours ;) But I agree that it would be much better to couple this with switching to an appropriate UCA collation.

Danny_B moved this task from Tag to Should be Goal instead on the Tracking-Neverending board.Jul 10 2016, 10:48 PM

Quiddity removed a parent task: T4007: [DO NOT USE] Tracking bug [superseded by #Tracking].Jul 14 2016, 3:20 AM

• Phabricator_maintenance added a project: Goal.Aug 13 2016, 8:42 PM

• Phabricator_maintenance renamed this task from Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki (tracking) to Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki.Aug 13 2016, 10:11 PM

• Phabricator_maintenance removed a project: Tracking-Neverending.

Liuxinyu970226 mentioned this in T32673: Implement central locale-specific, or tailored, sorting framework (tracking).Mar 20 2017, 11:06 AM

Amire80 subscribed.Sep 24 2017, 11:05 AM

jhsoby-WMNO added a subtask: T181503: Add proper category collation for the Northern Sami Wikipedia.Dec 15 2017, 2:09 AM

He7d3r mentioned this in T37632: Set $wgCategoryCollation to 'uca-default' and rebuild category sort keys on Portuguese Wikipedia.Dec 17 2017, 5:14 PM

Ebrahim mentioned this in T362494: Enable numerical category sorting on Commons.Jun 6 2024, 7:18 PM