Page MenuHomePhabricator

When using UCA collations, Persian digits (۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹) sorted under Western Arabic digits' (0 1 2 3 4 5 6 7 8 9) headings
Closed, ResolvedPublic

Description

Quoting bug 55565 comment #11:
https://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D9%88%D8%B1%D8%AA_%D9%81%D9%84%DA%A9%DB%8C_%D8%A8%D8%B1%D9%87
seems all digits type on first of page title is being converted to Arabic
digits. We shouldn't see '1' '2' '3' (Arabic Digits) and we should see '۱' '۲'
'۳' (Persian Digits) instead. Reproducible on all categories and also on
ckbwiki
https://ckb.wikipedia.org/w/index.php?title=%D9%BE%DB%86%D9%84:%DA%95%DB%86%DA%98%DB%95%DA%A9%D8%A7%D9%86%DB%8C_%D8%B3%D8%A7%DA%B5&action=edit&redlink=1
that are using Arabic-Indic digits.


Might also affect other numeral systems, I didn't test.


Version: 1.22.0
Severity: normal

Details

Reference
bz55630

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 2:28 AM
bzimport set Reference to bz55630.
matmarex created this task.Oct 11 2013, 4:47 PM

If they have same primary weight, we could just remove latin digits from first letters and add the farsi ones on a per language basis.

Yeah, that'll probably work, but I'm wondering why did it start happening now after a supposedly minor package upgrade.

allkeys.txt entries for '1' and '۱':

0031 ; [.159A.0020.0002.0031] # DIGIT ONE
06F1 ; [.159A.0020.0002.06F1][.0000.0166.0002.06F1] # EXTENDED ARABIC-INDIC DIGIT ONE

Same primary weight.

Trying to list each digit for each language IMO makes little sense (grepping the allkeys.txt file for "DIGIT ONE" yields 60 results).

I think we could use Language#formatNum() for each digit instead and replace Latin ones with localized ones in IcuCollation#getFirstLetterData (per Brian's suggestion), after applying $tailoringFirstLetters.

Be aware, we use different unicode for digits on ckb.wiki:
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩

Change 89488 had a related patch set uploaded by Bartosz Dziewoński:
IcuCollation: Sort digits under localised digits' headings

https://gerrit.wikimedia.org/r/89488

Bah, and of course ckb.wp has to use the 'uca-fa' collation, because otherwise it would be too easy to fix.

My patch above doesn't handle this case, because I don't see how we could do it without creating a faux collation for ckb and if()-ing it (which would be ugly), or using wiki language instead of collation language (which would be unexpected). Input welcome.

I guess we could apply the digit transformation on rendering a numeric section header in the category page, instead of in the collation. Not sure if that's really a good idea though.

There could be a "ckb" collation (i.e. not uca-ckb), class name CollationCkb which is a subclass of IcuCollation. You could have IcuCollation::getDigitTransformTable() which is overridden by the subclass. CollationCkb::construct() would call parent::construct('fa').

Doing it that way means that when ICU adds support for ckb, migration from ckb to uca-ckb can be done without breaking the wiki.

Or if the problem is likely to be repeated with other languages, there could be some regex-based alias feature in Collation::factory(), e.g. "alias-ckb/fa", where the collation name would specify both the ICU locale and the MW locale.

Change 95867 had a related patch set uploaded by Bartosz Dziewoński:
IcuCollation: Add CollationCkb subclass for Sorani Kurdish

https://gerrit.wikimedia.org/r/95867

(In reply to comment #9)

There could be a "ckb" collation (i.e. not uca-ckb), class name CollationCkb
which is a subclass of IcuCollation. You could have
IcuCollation::getDigitTransformTable() which is overridden by the subclass.
CollationCkb::construct() would call parent::construct('fa').

I implemented this in the patch above (which depends on the previous patch, https://gerrit.wikimedia.org/r/89488).

Or if the problem is likely to be repeated with other languages, there could
be
some regex-based alias feature in Collation::factory(), e.g. "alias-ckb/fa",
where the collation name would specify both the ICU locale and the MW locale.

I did not implement this, hopefully it will never be needed, because it sounds bad. :) But if we ever need it, it won't be hard to migrate.

We're still working on it :) Both of my patches are waiting to be re-reviewed.

Change 89488 merged by jenkins-bot:
IcuCollation: Sort digits under localised digits' headings

https://gerrit.wikimedia.org/r/89488

Change 95867 merged by jenkins-bot:
IcuCollation: Add CollationCkb subclass for Sorani Kurdish

https://gerrit.wikimedia.org/r/95867

Change 101005 had a related patch set uploaded by Bartosz Dziewoński:
(bug 55630) $wgCategoryCollation = 'xx-uca-ckb' for ckbwiki

https://gerrit.wikimedia.org/r/101005

Status update: Tim merged the two patches. Thanks!

  • This means that category headings on fa.wikipedia and other wikis using languages with localised digits will start behaving correctly as soon as they are deployed, which will happen on 19 December (according to [[mw:MediaWiki_1.23/Roadmap]]).
  • ckb.wikipedia is troublesome because it's currently using a collation meant for 'fa'; my configuration patch above fixes that as well. (If it were not deployed, 'fa' digits would be used instead of 'ckb' digits.)

I'll leave this open for a while longer until everything is sorted out.

Change 101005 merged by jenkins-bot:
(bug 55630) $wgCategoryCollation = 'xx-uca-ckb' for ckbwiki

https://gerrit.wikimedia.org/r/101005

Looking at links from comment 0, everything seems to be in order now. Thanks for the help and reports, everyone!

Calak added a comment.Dec 19 2013, 8:55 PM

Thank you very much Bartosz Dziewoński.