Page MenuHomePhabricator

Implement uca-fa collation
Closed, ResolvedPublic

Description

Hi,
Sorting characters in fa wiki and other projects are not correct it must be
like
آ-ا-ب-پ-ت-ث-ج-چ-ح-خ-د-ذ-ر-ز-ژ-س-ش-ص-ض-ط-ظ-ع-غ-ف-ق-ک-گ-ل-م-ن-ه-و-ی
would please correct it?


Version: unspecified
Severity: enhancement

Details

Reference
bz30287

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:57 PM
bzimport set Reference to bz30287.
bzimport added a subscriber: Unknown Object (MLST).
Yamaha5 created this task.Aug 9 2011, 12:35 PM

Do these sorting problems appear on category pages or somewhere else?

(In reply to comment #1)

Do these sorting problems appear on category pages or somewhere else?

yes in
1-special: all of special's report
2- pagegenerator.py for bots
3-categories
4-all of pages that have wikimedia's list

re comment 6:

http://ehsanakhgari.org/article/php/persian-sorting-mysql

sorting things on the php side like that article suggests also is probably not going to happen.


It looks like the code points for letters that appear in fa but not in ar have code points a bit higher then the letters that are in ar (thus binary sorting by code point gives bad results).

In my testing, using the uca-default collation instead of the standard uppercase collation should fix this. (I only tested that پ (U+67E) and ت (U+62A) are sorted correctly relative to each other, but since it fixes those two, I'm assuming the others are fixed too. If not we could probably write a custom collation fairly easily).

So basically, we need to enable uca-default on wikimedia to fix this, or at least fix on the categories.

Special page reports probably won't be fixed anytime soon unfortunately (Is there another bug for that?). pagegenerator.py probably is just using special:allpages, which probably won't be fixed in near future, but possibly the pywikipedia folks could sort the list on the client side.

when the uca-default collation instead of the standard uppercase collation will use?

(In reply to comment #8)

when the uca-default collation instead of the standard uppercase collation will
use?

All of our Apaches have to be upgraded to a newer version of Ubuntu first, so UCA is available. The operations team is still working on that.

Huji added a comment.Sep 18 2011, 5:40 PM

What this bug requests is implementing collations for Persian Wikipedia; therefore it is a duplicate for bug 164.

  • This bug has been marked as a duplicate of bug 164 ***

Weren't we trying to split bug 164 into multiple tracking bugs? At any rate more work needs to be done for this, so i think it makes sense to keep this open as a dependency of bug 30673.

Re-opening on that basis.

Collation table for Persian (a.k.a "Farsi") is documented by the MimerSQL documentation for developers on
http://developer.mimer.com/charts/persian.htm
which documents it with this rule:

CREATE COLLATION persian FROM eor USING
'[Arabic]'
'&#064E#<<#0650#<<#064F#<<#064B#<<#064D#<<#064C#'
'&#0621#<#0622#'
'&#0627#<<#0671#<#0621#<<#0623#<<#0672#<<#0625#'
' <<#0673#<<#0624#<<#06CC0654#<<<#06490654#<<<#0626#'
'&#06A9#<<#06AA#<<#06AB#<<#0643#<<#06AC#<<#06AD#<<#06AE#'
'&#06CF#<#0647#<<#06D5#<<#06C1#<<#0629#<<#06C3#<<#06C0#<<#06BE#'
'&#06CC#<<#0649#<<#06D2#<<#064A#<<#06D0#<<#06D1#<<#06CD#<<#06CE#'

where "eor" is the base collation used for the standard "European Ordering Rules" (defined as both an ISO standard and a CEN standard), from which most other collation orders are based, with very small tailorings. It has a few other settings that requires specific adjustments indicated by the "[Arabic]" tailoring attribute, which has the effect of reordering all Arabic blocks before all letters of other scripts (but still after the ignorables, whitespaces, variables, common length marks, common currency symbols, and common digits). The rule above adds specific reordering of a few other letters (look at the collation chart).

Yes, this is different from the standard collation for the Arabic language, which is a bit simpler (and only adjusts secondary differences):

CREATE COLLATION arabic FROM eor USING
'[Arabic]'
'&#0627#<<#0622#<<#0627#<<#0621#<<#0623#<<#0625#<<#0624#<<#0626#'
'&#064A#<<#0649#'

and it is also different from the Urdu collation which is a bit more complex:

CREATE COLLATION urdu FROM eor USING
'[Arabic]'
'&#064B#<<#0652#<<#064E#<<#0650#<<#064F#<<#0670#<<#0656#<<#0657#'
' <<#064B#<<#064D#<<#064C#<<#0654#<<#0651#<<#0658#<<#0653#'
'&#0627#<<#0623#<#0622#'
'&#0648#<<#0624#'
'&#06CF#<#06C1#<<#0647#<#06BE#<#06C3#<<#0629#<#0621#'
'&#06CC#<<#0649#<<#064A#<<#0626#'
'&#0628#<#062806BE#' '&#067E#<#067E06BE#'
'&#062A#<#062A06BE#' '&#0679#<#067906BE#'
'&#062C#<#062C06BE#' '&#0686#<#068606BE#'
'&#062F#<#062F06BE#' '&#0688#<#068806BE#'
'&#0631#<#063106BE#' '&#0691#<#069106BE#'
'&#06A9#<#06A906BE#' '&#06AF#<#06AF06BE#'
'&#0644#<#064406BE#' '&#0645#<#064506BE#'
'&#0646#<#064606BE#' '&#06BA#<#06BA06BE#'
'&#0648#<#064806BE#' '&#06CC#<#06CC06BE#';

MimerSQL has defined these rules using EOR as the base collation; the CLDR project was initially based on the DUCET collation, but is now using a different base collation (a modified DUCET), which is nearer from the standard EOR (but still different).

Note that MimerSQL, just like also MySQL, the default Java runtime library,the .Net CLR library still does not support the newer syntax for contextual rules, and for reordering script blocks, which is only supported for now by the most recent version of ICU; it also lacks the support of newer attributes.

The DUCET will soon be changed to become nearer from the CLDR version made for ICU, but the modified DUCET in the CLDR also does not use any contextual rules (for compatibility with lots of other implementations of the UCA). For this reason, some scripts will still not sort as expected using only the CLDR rules, without using the extended syntax (for example with the Devanagari script, see the final vowelless consonnant clusters at end of syllables.

This is even more critical for Lao, which requires a very complex syllabification, that cannot be represented by a collation table, but only as a specific [Lao] attribute triggering its specific syllabification by code and sometimes dictionary lookups; the case also occurs with the collations for Thai and Khmer languages, but in less critical way).

So don't assume that any unique DUCET (or modified DUCET from CLDR, or even the EOR collation table) will make things correct for all languages. We still need tailorings on top of any base collation, for almost all languages in all scripts !

Huji added a comment.Sep 18 2011, 10:03 PM

Having the [Arabic] block listed before the Persian-specific letters is the reason letters will not be sorted correctly according to Persian alphabet.

For latin languages, this has been solved by introducing more than one collation in the latin1 family (i.e. latin1_german_ci, latin1_swedish_ci, ...). Using utf8_general_ci is also not an option: it works for Arabic, but not for Persian (for the above mentioned reason). The Persian community is also underrepresented in many of the online collaborations so I think it is very unlikely that MySQL or other responsibly authorities introduce a new collation (something along the lines of utf8_persian_ci) just for that purpose.

In the light of above explanation, what is a pragmatic solution to this problem?

Huji added a comment.Sep 18 2011, 10:06 PM

Of note: http://bugs.mysql.com/bug.php?id=29977

More than four years old.

Unless something has changed, we're not planning to use mysql's collation support, so this is irrelevant.

I've never said orsuggested that! This is perfectly relevant for the implementation of tailorings. This is also relevant because there's a documentation available for the collation needed for Persian, as well as because it is not the same as standard Arabic, or Urdu, as demonstrated...

Also, I did not used "MySQL" as the base documentation, but "MimerSQL", which does not have the bugs you have cited for MySQL (and the archived mails in those bugs are mostly about the primary level: what I cited was about the secondary level as well, forgotten in the discussions you cite, dating from 2007). MimerSQL apparently does not have these bugs, and that's why I cited it as a reference, but this does not mean that we need to use it for our code.

(In reply to comment #16)

I've never said orsuggested that! This is perfectly relevant for the
implementation of tailorings. This is also relevant because there's a
documentation available for the collation needed for Persian, as well as
because it is not the same as standard Arabic, or Urdu, as demonstrated...

I was more referring to Huji's comment about (what I took to be) mysql collations. To be honest at the time I made that comment, I had only briefly skimmed what you (Philippe) wrote.

However, with that said - since we plan to use php intl's extension, which is just a wrapper around the icu library - which from my understanding already implements persian tailorings (and from limited testing certainly seems to) a discussion about how to implement Persian tailorings isn't that relevant either.

All the hard stuff about this bug is essentially done (mostly by other libraries) Basically what's left is some loose ends related to being able to select which locale to use.

It's still interesting to know which version of ICU (and of its implemented CLDR data) is used in PHP's "intl" extension, or how it plans to support the expected change which will very likely occur soon.

(It is already being discussed in the internal Unicode mailing list, aka "unicore" for Unicode members, and on the CLDR mailing list, with ICU authors leading this CLDR discussion, but from which authors of PHP "intl" seem to be absent, following only the what is found in the CLDR releases ! It is also being discussed in the associated ISO working group maintaining the international collation standard, referenced by both the Unicode UCA technical standard and by the CLDR project in LDML specifications and in the design of tailoring rules).

More changes will appear soon in the next Unicode and CLDR versions (notably the DUCET will be significantly modified in the UTS, to become nearer from what is used in CLDR, and there should be changes to natively support the EOR collation supported by ISO and CEN standards).

(In reply to comment #19)

It's still interesting to know which version of ICU (and of its implemented
CLDR data) is used in PHP's "intl" extension...

I believe that depend on what version of icu was available when intl was compiled. On my system its using 4.4.2 (according to phpinfo() ) which I believe corresponds to CLDR 1.8. I imagine other people would have it compiled with a different icu version.

So this does not match the current 2.0.1 (2011-07-18) update of CLDR, and also not the major 2.0 release (2011-05-25).

Version 1.8 is dated 2010-03-17, and still does not match Unicode 6, the current version of LDML, the modified version of the DUCET for the CLDR "root" locale, newer contextual collation tailoring rules, and the newer reordering of full scripts for specific languages that can be written in multiple scripts (e.g. Serbian, Japanese, Chinese and several of its dialects, many South or Central Asian languages). Version 1.8 also still does not work correctly for Khmer and Lao scripts, and even includes issues with Hangul (Korean).

Version tracking of PHP's "intl" extension and ICU is then needed (in addition to PHP version, if it creates a dependancy). You must be more specific than just speaking about "intl" being used in MediaWiki. On the opposite ICU remains in stricter sync with versions of Unicode UTS#10 (UCA), LDML, and CLDR data.

It's also important to track which part of the CLDR has been integrated when compiling ICU for the PHP "intl" extension, and which specific tailoring data have been built into that ICU module (or as external datafiles).

And as far as I know, ICU still does not natively implement the EOR collation (as defined equivalently in ISO and CEN standards); it also has some experimental code for future proposed or pending updates to these collation standards (including a refined, contextual, definition of static "collation levels", in order to later deprecate some of the too many existing "attributes" which often lack a stricter formal definition for interoperability).

So this does not match the current 2.0.1 (2011-07-18) update of CLDR, and also
not the major 2.0 release (2011-05-25).

Well I installed via apt-get, which I'm sure is a little dated. If you installed via some other means, it'd probably be more up to date.

Version 1.8 is dated 2010-03-17, and still does not match Unicode 6, the
current version of LDML, the modified version of the DUCET for the CLDR "root"
locale, newer contextual collation tailoring rules, and the newer reordering of

[..]

Version tracking of PHP's "intl" extension and ICU is then needed (in addition
to PHP version, if it creates a dependancy). You must be more specific than
just speaking about "intl" being used in MediaWiki. On the opposite ICU remains
in stricter sync with versions of Unicode UTS#10 (UCA), LDML, and CLDR data.
It's also important to track which part of the CLDR has been integrated when
compiling ICU for the PHP "intl" extension, and which specific tailoring data
have been built into that ICU module (or as external datafiles).
And as far as I know, ICU still does not natively implement the EOR collation
(as defined equivalently in ISO and CEN standards); it also has some
experimental code for future proposed or pending updates to these collation
standards (including a refined, contextual, definition of static "collation
levels", in order to later deprecate some of the too many existing "attributes"
which often lack a stricter formal definition for interoperability).

Why? How does this affect us (beyond the obvious people using older version get crappier collation support).

(In reply to comment #22)

Why? How does this affect us (beyond the obvious people using older version get
crappier collation support).

Look at the many changes documented in the CLDR site, each version has a log listing these changes in the bug tracker, as well as a summary report for each version.

Yes since CLDR 1.8 (based on the Unicode 5.0 subset of the UCS, plus only 4 additional characters that were standardized soon with a minor updated of Unicode 5) and in sync with ISO 14651:2007), there has been significant changes that affects Persian sorting (as well as Urdu) for cases specific to languages other than Arabic, written with the Arabic scripts, as well as on the Bidi algorithm (before the Bidi classes were frozen).

Given that the major release 6.0 of Unicode is there now since months (as well as the 2011 release of ISO 10646 now in its second generation) and the Unicode DUCET has been released at the same time, and the CLDR project also integrated it, before proposing a new extension format for easier and stable tailorings, the ISO 14651 standard should be updated soon (there's still a few discussion about a few cases, notably for Lao, Hindi, and variable elements).

Then look at the ICU version log which also has its own buglist and tracker.

(In reply to comment #9)

(In reply to comment #8)

when the uca-default collation instead of the standard uppercase collation will
use?

All of our Apaches have to be upgraded to a newer version of Ubuntu first, so
UCA is available. The operations team is still working on that.

does uca-default collation is installed?

does uca-default collation is installed?

Yes. It is currently enabled at pt.wikipedia.org

However it may sort some characters incorrectly until we support the tailored collations. Specificly everything that's coloured blue on http://collation-charts.org/icu442/icu442-fa.html will probably sort incorrectly

the bug are on
گ DAAF
ک DAA9
ژ DA98
پ D9BE
ی DB8C
چ DA86
now it shows them at the end these Unicode glyphs are not in Arabic language and because of that they have problem in sorting
for example in
http://fa.wikipedia.org/wiki/%D8%B1%D8%AF%D9%87:%DA%A9%D8%B4%D9%88%D8%B1%D9%87%D8%A7%DB%8C_%D8%A2%D8%B3%DB%8C%D8%A7%DB%8C%DB%8C

they should be like below which آ is the first

آ-ا-ب-پ-ت-ث-ج-چ-ح-خ-د-ذ-ر-ز-ژ-س-ش-ص-ض-ط-ظ-ع-غ-ف-ق-ک-گ-ل-م-ن-ه-و-ی

Also the blue rectangles are Urdu not Farsi except ه

(This doesn't really depends on bug 30996, removing the dependency. If uca-default is suitable, it should block it. If it's not, support should be implemented based on I838484b9 and this short be marked as blocking bug 45443.)

This should be do-able now.

Reedy added a comment.May 16 2013, 8:14 PM

(In reply to comment #29)

This should be do-able now.

Do-able in what sense? Code-able or deploy-able?

Oh, I thought it was deployable, but looks like there's still some code to do. (There is definitely support in icu library. Its not in the array in Collation.php )

Related URL: https://gerrit.wikimedia.org/r/64251 (Gerrit Change I3c30824f7d133cf615ec7c2c39d31f27c39f89fe)

Change 64251 merged by jenkins-bot:
Add fa to collation list.

https://gerrit.wikimedia.org/r/64251

Marking this as fixed, as the ability to do this has been implemented in MediaWiki proper.

Bug 50311 is now about deploying this in 'fa' wikis.

(In reply to comment #33)

Change 64251 merged by jenkins-bot:
Add fa to collation list.
https://gerrit.wikimedia.org/r/64251
Based on http://collation-charts.org/icu442/icu442-fa.html
Should be verified by a native speaker.

As a native speaker I confirm http://collation-charts.org/icu442/icu442-fa.html

Huji added a comment.Jun 28 2013, 11:19 PM

Second that.