Implement locale-specific sorting for Polish language
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matmarex
	Nov 24 2012, 3:24 PM

Description

PolishCollation class

Polish Wikipedia needs correct Polish language sorting.

The IcuCollation class (that can be enabled using $wgCategoryCollation = 'uca-default') doesn't cut it – regular ASCII letters and their variants with diacritics are treated as the same ones in sorting, and are displayed under the ASCII heading together, which is incorrect in Polish.

The full Polish alphabet is: AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ, in this order, that is ABCDEFGHIJKLMNOPQRSTUVWXYZ + ĄĆĘŁŃÓŚŹŻ - QVX.

IcuCollation doesn't seem to accept a 'pl' or 'pl_PL' argument, so I have implemented a small class deriving from IcuCollation (attached), but since it looks like I'm the first one to try something like this, somebody with more MW experience will have to figure out where to stick it in the code – right in includes/Collation.php or maybe in an extension? Also, I have only tested it with Latin-based alphabets – it *should* work just right with other ones, but you guys might want to test this.

Additionally, for a reason that is to me inexplicable, IcuCollation, as well as my derivation, insists on sorting articles starting with "A" under "⅍". I have no idea why this happens or how to change it properly.

Here's a testwiki where I implemented it, and two accordingly sorted categories:

http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Test
- This one simply contains some bot-generated articles and some weird tests. It's a good basic testcase, showing interactions between ASCII characters, Polish diacritics and other non-ASCII chars (I used some German umlauted letters).

http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi
- This is simply one full category imported from pl.wikipedia, containing Polish movie actors. On http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi&pageuntil=Machnicki%2C+Ireneusz%0AIreneusz+Machnicki one can see the behavior of "L" and "Ł".

As mentioned before, the collations class is attached; to make it work, I simply patched includes/Collation.php, adding case 'polish': return new PolishCollation( 'root' ); to Collation::factory, setting $wgLanguageCode = "pl"; and $wgCategoryCollation = 'polish';.

I'd greatly appreciate feedback, or maybe a simpler, less hacky solution :)

(See also: bug 29788, the same thing for Swedish)

Version: 1.21.x
Severity: normal

Attached:

PolishCollation.php1 KBDownload

Details

Reference: bz42412

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T32996 Change $wgCategoryCollation values to appropriate one for each Wikimedia wiki
Open	Feature	None	T47443 Deploy language-specific "uca-xx" collations on Wikimedia wikis
Resolved		None	T44413 Set $wgCategoryCollation to 'uca-pl' on Polish Wikipedia and rebuild category sort keys
Resolved		matmarex	T47968 Set $wgCategoryCollation to 'uca-pl' on Polish Wikivoyage and rebuild category sort keys
Open		None	T32672 Use locale-specific sorting (tracking)
Open		None	T32754 Use correct sorting for in prefix searches
Open		None	T32673 Implement central locale-specific, or tailored, sorting framework (tracking)
Resolved		matmarex	T44412 Implement locale-specific sorting for Polish language

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:55 AM

• bzimport added projects: MediaWiki-Categories, I18n.

• bzimport set Reference to bz42412.

matmarex created this task.Nov 24 2012, 3:24 PM

Hi. Thanks for the patch. [I don't have time to look at it right now, but hopefully me or someone else will shortly].

Additionally, for a reason that is to me inexplicable, IcuCollation, as well as
my derivation, insists on sorting articles starting with "A" under "⅍". I have
no idea why this happens or how to change it properly.

This is a guess, but: For the IcuCollation, there is a first-letter.ser (named something like that) file which determines how things are sorted under first letter. This file needs to be kept consist with the collation. Presumably you're collation is using that file, when a new one needs to be generated for it (There's scripts to generate the file in the maintinance sub directory. Off the top of my head I'm unsure how tied they are to ICUCollation or if they would work directly with yours).

I just noticed that my patch has a pretty stupid bug – it only handles first letters of words for sorting, instead of every letter. Some unicode-aware preg_replace would fix this.

Additionally, for a reason that is to me inexplicable, IcuCollation, as well as
my derivation, insists on sorting articles starting with "A" under "⅍". I have
no idea why this happens or how to change it properly.

btw, for anyone else following along - that issue is split to bug 43740

Actual patch:

I don't think this is the way that extending uca-default was envisioned of working. I believe the intention was more to use intl's built in support for different language collations [via new IcuCollation( 'pl' ) ] and add a first-letters-pl.ser file. This would probably integrate the sorting of polish letters with other types of letters more seamlessly. However, how to generate a first-letters-pl.ser file is a bit of an open question at the moment, and probably requires a much expanded maintenance/languages/generateCollationData.php file. [Although I have a vague idea how to make one that would imperfectly, but probably acceptably by hand]

With the actual patch, the getFirstLetter takes a binary string. There's no guarantee that the icu collation won't use a binary code that is also a code point for one of the "polish" letters (in practise that's probably rare though). Additionally since its a binary string, its not guaranteed to be valid UTF-8, and I'm not sure how mb_substr would handle invalid utf-8.

(In reply to comment #3)

I believe the intention was more to use intl's built in support for
different language collations [via new IcuCollation( 'pl' ) ] and add a
first-letters-pl.ser file.

As noted in the zeroth comment, IcuCollation doesn't seem to accept a 'pl' or 'pl_PL' argument for me. I tried that first, of course.

Does either of these work for you? Maybe I just need more PHP libs...

(In reply to comment #4)

As noted in the zeroth comment, IcuCollation doesn't seem to accept a 'pl' or
'pl_PL' argument for me. I tried that first, of course.

Per out IRC discussion, it turns out it actually does, but it also needs the .ser file with a 'pl' suffix. This makes the attachment essentially useless.

I just realized I said some incorrect things in comment 3.

(In reply to comment #3)

With the actual patch, the getFirstLetter takes a binary string. There's no
guarantee that the icu collation won't use a binary code that is also a code
point for one of the "polish" letters (in practise that's probably rare
though). Additionally since its a binary string, its not guaranteed to be
valid
UTF-8, and I'm not sure how mb_substr would handle invalid utf-8.

This is all incorrect. getFirstLetter takes the human readable sortkey.

This makes the attachment essentially useless.

Not entirely useless, after all the patch works now, where we don't have .ser files for pl yet. But Ideally we'd be doing this with the .ser file.

So the patch here is crap, but I submitted I838484b9 to fix this properly.

Merged.

Implement locale-specific sorting for Polish languageClosed, ResolvedPublicActions