Page MenuHomePhabricator

Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild category sort keys
Closed, ResolvedPublic

Description

Per bug 45443, category collation should be localized on hu.wikipedia.


Version: unspecified
Severity: enhancement

Details

Reference
bz45596

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 1:37 AM
bzimport set Reference to bz45596.
Tgr created this task.Mar 1 2013, 8:20 AM
Tgr added a comment.Mar 1 2013, 8:24 AM

I reviewed Collation::$tailoringFirstLetters['hu'], it contains exactly the non-ASCII letters which can be used for first letter grouping in Hungarian.

Additionally, for all long-short vowel pairs (a – á, e – é, i – í, o – ó, ö – ő, u – ú, ü – ű) the long vowel should be treated as if it were the short one (e.g. the word "álom" should be listed under A, or the word "űr" under Ü). Since the Hungarian collation treats these pairs as equivalent, I suppose that is done automatically?

Tgr added a comment.Mar 1 2013, 8:27 AM

For reference, Collation::$tailoringFirstLetters['hu'] contains this list:

"CS", "DZ", "DZS", "GY", "LY", "NY", "Ö", "SZ", "TY", "Ü", "ZS"

I set up a testwiki in Hungarian with uca-hu collation enabled for you: http://users.v-lo.krakow.pl/~matmarex/testwiki-hu

Feel free to link it on-wiki and use it however you want, just be aware that it won't stay up forever after this bug is closed :)

And I filled a test category with some letters and symbols: http://users.v-lo.krakow.pl/~matmarex/testwiki-hu/index.php?title=Kateg%C3%B3ria:Test

It seems to work correctly at the first glance, but I don't speak Hungarian :)

Tgr added a comment.Mar 2 2013, 2:41 PM

Seems good to me, but I'll ask more knowledgeable people as well. Will there be a way to override the default placements? Foreign words should not be categorized under a digraph even if they are written the same way (e.g. the Tycho crater should be under T, not TY); there is of course no way to automatize that, bus something like {{#DEFAULTSORT:Tycho|T}} would be nice.

(In reply to comment #5)

Seems good to me, but I'll ask more knowledgeable people as well. Will there
be
a way to override the default placements? Foreign words should not be
categorized under a digraph even if they are written the same way (e.g. the
Tycho crater should be under T, not TY); there is of course no way to
automatize that, bus something like {{#DEFAULTSORT:Tycho|T}} would be nice.

Good point, I didn't think of that. I tested and this seems possible by using a
[[zero-width non-joiner]] – you could place a {{DEFAULTSORT:T‌ycho}} on the page with such name to force it to behave correctly. This forces the "t" and "y" to be considered separately, and the non-joiner itself has no effect during sorting. (See that test category again.)

With Scribunto/Lua now being deployed, this could be easily be made into a template, looking and behaving somewhat like [[Template:lowercase title]], so that editors wouldn't have to worry about the strange syntax.


Also, please hold a community discussion/voting on the Hungarian Wikipedia about this change, even if it's just a formality. I am not a WMF employee, but their policy is clear – a configuration change (especially one that is this disruptive) can only be made if there's obvious consensus. You can link the Hungarian testwiki I created there.

There's no hurry, especially since this change can only be made after MW 1.21wmf11 is deployed on March 13.

Here's a very similar voting/discussion I created on pl.wikipedia, regarding the same change, but for Polish: short explanation, voting and comments with yes/no icons.

https://pl.wikipedia.org/wiki/Wikipedia:PR#Zmiana_konfiguracji_.E2.80.93_w.C5.82.C4.85czenie_poprawnego_sortowania_artyku.C5.82.C3.B3w_na_stronach_kategorii

Samat added a comment.Mar 2 2013, 4:54 PM

Thank you for your effort! It will be a long-awaiting (~9 years) bug fix on the Hungarian Wikipedia.

Tgr added a comment.Mar 3 2013, 5:20 PM

Thanks! ‌ looks like a good solution. Would it be possible to make the digraphs title case (that is, "Cs" instead of "CS")?

Sorry for late reply.

(In reply to comment #8)

Would it be possible to make the
digraphs title case (that is, "Cs" instead of "CS")?

Should be pretty easy to do. If that's how it's supposed to be done everywhere, I think we could titlecase the digraphs in IcuCollation::$tailoringFirstLetters['hu'] and it should "just work". I can do it if it's the proper solution.

And if that's only how hu.wiki wants this to look (and uppercased digraphs are correct in general), you could use a little CSS to uppercase the first letter and lowercase the rest:

#mw-pages h3 { text-transform: lowercase; }
#mw-pages h3::first-letter { text-transform: uppercase; }
Tgr added a comment.Mar 9 2013, 4:30 PM

Yes, as far as I am aware, it should always be done that way in Hungarian installations.

I submitted Ie0ca297a to fix this (and deployed it on my testwiki).

Can you hold a little mini-voting (in the village pump, probably, see comment 6) to confirm you really do want this changes as the hu.wiki community? Just for the paper trail :)

Tgr added a comment.Mar 9 2013, 6:28 PM

I will start the on-wiki discussion shortly. A few more questions that came up:

  • will it be harder the change the rules on the fly, if they turn out to be imperfect? I understand changing the collation is difficult because one has to reindex the whole table, but I suppose changing the first letters would be simpler.
  • by the way, should we also check the collation itself? I have mostly collected input on the first letter grouping until now.
  • will it be possible to create custom groups? (e.g. someone suggested using a "Numbers" group, having separate groups for all digits looks a bit silly)
  • what is the logic for non-Hungarian characters? Accented latin characters seem to be ordered as if the accents were stripped, which is good, but it would be nice to see the rules spelled out somewhere.

(In reply to comment #12)

  • will it be harder the change the rules on the fly, if they turn out to

be imperfect? I understand changing the collation is difficult because
one has to reindex the whole table, but I suppose changing the first
letters would be simpler.

Real changes to the collation will require running the update script again,
which might take a couple of hours for hu.wiki (according to Reedy's
testing, it took about 20 hours for the 3.2 million pages on pl.wikipedia).
Category sorting might be slightly borked during this time, and all category
pages will have to be purged afterwards (action=purge or just wait till the
caches expire).

Changing the first letters later won't break the collation, since it's
entirely handled by an external library (ICU); it'll require a purge to
appear on-wiki, though.

  • by the way, should we also check the collation itself? I have mostly

collected input on the first letter grouping until now.

Please do, but I'm pretty much certain it's correct; it's handled by the ICU
library, which is a battle-tested and mature piece of software.

  • will it be possible to create custom groups? (e.g. someone suggested

using a "Numbers" group, having separate groups for all digits looks a
bit silly)

This isn't supported right now, but at a first glance possible; it would
likely depend on whether creating the group would require different sorting
order. However, IMO this particular change should be done for all projects
at once, if desired, and should wait for the natural number sorting to be
implemented first (bug 6948) and for multiple collation support (bug 44667;
the chinese-collation branch includes this).

  • what is the logic for non-Hungarian characters? Accented latin

characters seem to be ordered as if the accents were stripped, which is
good, but it would be nice to see the rules spelled out somewhere.

Yes, that's exactly what happens, and similarly for accented variants of
letters in other alphabets; I though I mentioned that somewhere, apologies.
The default sorting rules are the ones [[Unicode Collation Algorithm]] uses;
they are appropriately tailored for each language-specific collation.

The default "first-letters" list includes full basic latin, greek and
cyrillic alphabets and I think all printable ASCII characters, as well as a
lot of letters from other alphabets and a whole lot of Unicode symbols. It
is generated by MediaWiki based on the data about which letters have
primary-level weight in UCA, but I'm not sure what is the exact behavior;
you can see the generation script at
/maintenance/language/generateCollationData.php in mediawiki/core
repository, and the pregenerated list at /serialized/first-letters-root.ser.
I doubt that's relevant, though. :)

The upgrade to ICU 4.8 should be done before any more wikis start using uca-* collations.

The upgrade is done now. Submitted config change proposal as I0cfa3859.