PolishCollation class
Polish Wikipedia needs correct Polish language sorting.
The IcuCollation class (that can be enabled using $wgCategoryCollation = 'uca-default') doesn't cut it – regular ASCII letters and their variants with diacritics are treated as the same ones in sorting, and are displayed under the ASCII heading together, which is incorrect in Polish.
The full Polish alphabet is: AĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻ, in this order, that is ABCDEFGHIJKLMNOPQRSTUVWXYZ + ĄĆĘŁŃÓŚŹŻ - QVX.
IcuCollation doesn't seem to accept a 'pl' or 'pl_PL' argument, so I have implemented a small class deriving from IcuCollation (attached), but since it looks like I'm the first one to try something like this, somebody with more MW experience will have to figure out where to stick it in the code – right in includes/Collation.php or maybe in an extension? Also, I have only tested it with Latin-based alphabets – it *should* work just right with other ones, but you guys might want to test this.
Additionally, for a reason that is to me inexplicable, IcuCollation, as well as my derivation, insists on sorting articles starting with "A" under "⅍". I have no idea why this happens or how to change it properly.
Here's a testwiki where I implemented it, and two accordingly sorted categories:
- http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Test
- This one simply contains some bot-generated articles and some weird tests. It's a good basic testcase, showing interactions between ASCII characters, Polish diacritics and other non-ASCII chars (I used some German umlauted letters).
- http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi
- This is simply one full category imported from pl.wikipedia, containing Polish movie actors. On http://users.v-lo.krakow.pl/~matmarex/testwiki/index.php?title=Kategoria:Polscy_aktorzy_filmowi&pageuntil=Machnicki%2C+Ireneusz%0AIreneusz+Machnicki one can see the behavior of "L" and "Ł".
As mentioned before, the collations class is attached; to make it work, I simply patched includes/Collation.php, adding case 'polish': return new PolishCollation( 'root' ); to Collation::factory, setting $wgLanguageCode = "pl"; and $wgCategoryCollation = 'polish';.
I'd greatly appreciate feedback, or maybe a simpler, less hacky solution :)
(See also: bug 29788, the same thing for Swedish)
Version: 1.21.x
Severity: normal
Attached: