Recently categories at fa.wikipedia sort wrongly the pages which are started with ا and category show them at ء group
for example at here the page بو اوتاس (the second page at the category) should be under ا not ء.
In my opinion some changes since last month cause this bug we don't have this bug before last month and it may be because of uca-fa collation or ICU
uca-fa's story started from T32287
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | kaldari | T139110 uca-fa collation shows pages starting with ا incorrectly under ء | |||
Resolved | MoritzMuehlenhoff | T143931 Update ICU version to 55.1 |
Event Timeline
Change 300001 had a related patch set uploaded (by Ladsgroup):
Do not collate "ا" in Persian language for category sort
Change 300001 merged by jenkins-bot:
Do not collate "ا" and "و" in Persian language for category sort
That patch doesn't seem to resolve the issue. Ladsgroup and I had a pretty long chat on #mediawiki on IRC trying to find out why and we didn't really get far.
We recently upgraded WMF's version of ICU from 4.8 to 52 (a.k.a 5.2) per T86096. My suspicion was that the three problematic characters (ء ا و) are now ordered with only secondary or tertiary difference, rather than primary as before, and they're rejected by the "Primary collision" case in IcuCollation::fetchFirstLetterData(). But they don't seem to be defined that way in http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/fa.txt, so I really don't know what's happening.
Category for testing: http://fa.wikipedia.beta.wmflabs.org/wiki/رده:صفحههای_ابهامزدایی (view sortkeys with the API: ApiSandbox link).
@Bawolff : Hey, do you know why it doesn't work? http://fa.wikipedia.beta.wmflabs.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D9%81%D8%AD%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%D8%A8%D9%87%D8%A7%D9%85%E2%80%8C%D8%B2%D8%AF%D8%A7%DB%8C%DB%8C
Pages started with "ا" or "و" must not be in section (header?) called "ء". They should have their own section. We purged the cache in "fawiki:first-letters:fa:fa:52.1:2" in IcuCollation, still not working.
> echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA1" ) ); 2627 > echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA7" ) ); 2627 > echo Collation::singleton()->getICUVersion() 52.1 >
So yeah, they appearently aren't primary different in uca-fa, but do seem to be on https://ssl.icu-project.org/icu-bin/collation.html so maybe this changed in the unicode allkeys file between unicode versions
or actually, they have different primary weights in arabic and english collation, so its farsi tailoring related
I wonder if its related to http://unicode.org/cldr/trac/ticket/6738 (Just looking at recent changes to the collation data)
Thanks for spending time on this. This is a huge issue in Persian Wikis (it's like putting articles starting with "E" and "U" under "Ê" in English Wikipedia). I wish I could be more helpful on finding out what is wrong.
@Bawolff Can we just special-case these three characters in IcuCollation::fetchFirstLetterData(), so that they are kept even with primary collisions?
If we did that, the headers would bounce back and forth, since they would be intermixed in the actual sorting output
Just to be clear, Does updating to ICU 55.1 would solve the issue? If it does, Can we just update the ICU version?
If not. I start making the xx-uca-fa.
It probably would, but it probably can't be done easily / on short notice. But maybe it can. You could file a task in SRE to ask folks to look into the possibility.
Change 327507 had a related patch set uploaded (by Ladsgroup):
Make fawiki in beta use xx-uca-fa
Mentioned in SAL (#wikimedia-releng) [2016-12-15T14:48:22Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)
Nevermind, that was only on Beta Cluster. CommTech can do the actual deployment, unless @Ladsgroup wants to handle that as well.
I can't confirm that it's fixed in beta an example. If we can be sure it's fixed, I'll do the prod part too.
@matmarex: One of users has suggested we remove the tailorings and leave it to the ICU. What do you think?
I do not see how that would help. (Although I also do not see why the current approach did not actually work…) Unless you mean actually switching from uca-fa to uca-default or something like that?
I spent some time today setting up a Ubuntu 14 virtual machine with ICU 52 to test this. It doesn't work because the "Remove duplicate prefixes" code in IcuCollation is still removing the letters, because the primary sortkey for ء is still a prefix of the tertiary sortkeys for ا and و.
Change 329493 had a related patch set uploaded (by Bartosz Dziewoński):
CollationFa: Third time's the charm
Mentioned in SAL (#wikimedia-releng) [2017-01-07T10:17:26Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)
Okay, I cleaned the cache as much as possible. Pages with one letter get into their right place but pages with more than one letter stay under incorrect one: here's the example. I even made a new page to be sure it's not because of caching.
Change 354598 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Hack around icu breakage for fa sorting
@Bawolff is getting it fixed in the hackathon (Thanks!) and I'd say let's merge it because that ICU is not being used in production and only beta cluster and we can simply test it by merging it.
Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting
Change 357951 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa
Change 357951 merged by jenkins-bot:
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:21:16Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:22:42Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikisource --previous-collation=uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:23:14Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiktionary --previous-collation=uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:39Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikibooks --previous-collation=uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:49Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikinews --previous-collation=uca-fa
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:25:30Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikiquote --previous-collation=uca-fa
Most wikis have been updated. fawiki is still ongoing though since it is quite large.
Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:38:07Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Change Persian Wikis from uca-fa to xx-uca-fa - T139110 (duration: 00m 41s)
fawikisourcehas some left over:
[fawikisource]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation; +--------------+----------+ | cl_collation | count(*) | +--------------+----------+ | uca-fa | 12 | | xx-uca-fa | 53010 | +--------------+----------+ 2 rows in set (0.03 sec)
We would probably want to switch fawikivoyage which currently uses uppercase has a collation.
Is that from tool labs or actual production? In the past replication issues have lead to tool labs reporting there are still rows when there actually isnt. In any case its fine to run the maintenance script multiple times (as long as you dont use --force) if some rows are left.
The update script has completed on fawiki (was: mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa ).
@Bawolff the query comes the production database.
I guess I can close this fawiki task and the bawiki one that have a similar issue (T162823#3341107 ). Then eventually we can fill a new task regarding categorylinks being out of sync?
Change 433725 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/core@master] Remove everything related to CollationFa
Change 433725 merged by jenkins-bot:
[mediawiki/core@master] Remove everything related to CollationFa