Page MenuHomePhabricator

Category collation for Estonian projects
Closed, ResolvedPublic

Description


Version: unspecified
Severity: enhancement
URL: http://unicode.org/cldr/trac/ticket/6701

Details

Reference
bz54168

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 1:50 AM
bzimport set Reference to bz54168.
bzimport added a subscriber: Unknown Object (MLST).
Pikne created this task.Sep 16 2013, 3:34 PM
Pikne added a comment.Sep 16 2013, 4:00 PM

Pages in etwiki, etwikisource, etwikiquote and etwikibooks categories should be order as the letters are ordered in Estonian alphabet. (Not sure about etwiktionary, where page names are in all languages, this probably needs further discussion.)

I assume this is done by setting $wgCategoryCollation to uca-et. *But* I don't know what exactly is behind this setting in current version or where can I check this. Some UCA related web pages suggest that UCA for Estonian sorts words beginning with letter 'W' under 'V'. This is wrong. If this is the case for our uca-et setting too, then this should be changed in a way that 'W' is sorted as a separate letter after 'V' and before 'Õ'.

This is the chart for uca-et: http://collation-charts.org/icu442/icu442-et.html (things on the same line are roughly considered the same letter. This chart is for a different version of uca then we use. Ill check later when I have my laptop if things are different for later versions)

Assuming that this chart is still right for later versions of uca, v and w are considered to have a secondary difference. Which basically means they are considered the same unless there is a tie, and if there is a tie, v comes first.

To be clear, it should be sorted as "V: Vatikan, Volga | W: Wales, Windsor" and not "V: Wales, Vatikan, Windsor, Volga". E.g. see the place name section of the dictionary of standard Estonian: http://www.eki.ee/dict/qs/kohanimed.html

If v and w are treated as in the chart referenced above, then I assume you can modify this here as the chart was modified for Finnish in bug 46330?

(In reply to comment #2)

This is the chart for uca-et:
http://collation-charts.org/icu442/icu442-et.html
(things on the same line are roughly considered the same letter. This chart
is
for a different version of uca then we use. Ill check later when I have my
laptop if things are different for later versions)

It appears this chart is still accurate (For reference for myself, since I can never find it, most recent version of icu library rules is at https://ssl.icu-project.org/repos/icu/icu/trunk/source/data/coll/et.txt )


If v and w are treated as in the chart referenced above, then I assume you can
modify this here as the chart was modified for Finnish in bug 46330?

Finish had an issue with the section headings. The chart itself wasn't modified.

We don't have the ability to do custom charts at the moment (The functionality is supported in ICU library, but PHP's intl library doesn't expose it to us).

We could maybe do something hacky like replace "W" with U+1D21 ('LATIN LETTER SMALL CAPITAL W' - which does not get sorted like "V" in uca-et collation) just for the sorting.

(In reply to Bawolff (Brian Wolff) from comment #4)

We could maybe do something hacky like replace "W" with U+1D21 ('LATIN
LETTER SMALL CAPITAL W' - which does not get sorted like "V" in uca-et
collation) just for the sorting.

I tried this and it seems to work reliably. I think we could do it and drop the workaround when upstream fixes their data.

Change 147980 had a related patch set uploaded by Bartosz Dziewoński:
Collation: Workaround for incorrect collation of Estonian

https://gerrit.wikimedia.org/r/147980

Created attachment 15986
Sorting order using the patch from comment 6

Attached:

I see there is an upstream report for this:
http://unicode.org/cldr/trac/ticket/6701

It was opened on the same day as this bug, with a very similar description, I assume by the same person.

Change 147980 merged by jenkins-bot:
Collation: Workaround for incorrect collation of Estonian

https://gerrit.wikimedia.org/r/147980

Hooray :D

The next step is to hold a quick discussion/vote on each wiki that would want this enabled, just to make sure nothing happen behind someone's back. Pikne, can you do that?

I set up a little testing wiki on Labs: http://estonia.wmflabs.org/ (or rather had one set up for me by Yuvi :) ). Please verify that this indeed works correctly. Feel free to create categories and pages and link it in the on-wiki discussions. (The wiki will probably disappear when it is no longer needed.)

I already created two categories:

Pikne added a comment.Jul 22 2014, 3:08 PM

(In reply to Bartosz Dziewoński from comment #10)

The next step is to hold a quick discussion/vote on each wiki that would
want this enabled, just to make sure nothing happen behind someone's back.

I asked for this and specifically about v and w difference on Estonian Wikipedia by the time I opened this bug: [[et:Vikipeedia:Üldine arutelu/Arhiiv 27#Tähestikuline järjestus kategoorias]]. There are no objections. As for Wikisource, Wikibooks and Wikiquote, a few people active there are also active on Wikipedia (and Wikipedia is where I would look for these a few contributors), so I would say that we more less have their consent as well. As for Wiktionary, I now asked them if they perhaps wanted uca-default instead or if it's worthwhile to change anything there now. I think we can consider it a separate bug if there will be a change on Estonian Wiktionary.

I set up a little testing wiki on Labs.

Test categories look fine.

(In reply to Tim Starling from comment #8)

It was opened on the same day as this bug, with a very similar description,
I assume by the same person.

Yes, I opened it in hope that this brings as nearer the solution.

Pikne added a comment.Aug 8 2014, 9:20 AM

(In reply to comment #11)

I think we can consider it a separate bug if there will be a change
on Estonian Wiktionary.

Then again, by now it seems that uca-et is fine enough for Wiktionary as well: [[:et:wikt:Vikisõnastik:Üldine arutelu#Järjestus kategooriates]].

I think we can move on with setting uca-et for all Estonian projects and recomputing the sort keys.

Change 154213 had a related patch set uploaded by Bartosz Dziewoński:
Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis

https://gerrit.wikimedia.org/r/154213

I have uploaded the configuration patch, but collation config changes seem to be on hold for a while now and I don't know why. No idea how long this is going to take :/

Change 154213 merged by jenkins-bot:
Set $wgCategoryCollation to 'xx-uca-et' on all Estonian-language wikis

https://gerrit.wikimedia.org/r/154213

Reedy added a comment.Sep 17 2014, 1:04 AM

This is done now... Any further improvements needed, or can we close the bug?

Great. Thanks for doing the hacky part and the rest.

Though, as for the upstream part of this bug, today is about the day when CLDR v26 is expected to be released and the w and v difference should be fixed there too :)