uca-fa collation shows pages starting with ا incorrectly under ء
Closed, ResolvedPublic

Description

Recently categories at fa.wikipedia sort wrongly the pages which are started with ا and category show them at ء group
for example at here the page بو اوتاس (the second page at the category) should be under ا not ء.
In my opinion some changes since last month cause this bug we don't have this bug before last month and it may be because of uca-fa collation or ICU
uca-fa's story started from T32287

Yamaha5 created this task.Jun 30 2016, 8:36 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 30 2016, 8:36 PM
Yamaha5 updated the task description. (Show Details)Jun 30 2016, 8:40 PM
Aklapper renamed this task from uca-fa collation has bug to uca-fa collation shows pages starting with ا incorrectly under ء.Jul 3 2016, 9:44 PM

Change 300001 had a related patch set uploaded (by Ladsgroup):
Do not collate "ا" in Persian language for category sort

https://gerrit.wikimedia.org/r/300001

The characters have this issue: "ا" and "و"

Change 300001 merged by jenkins-bot:
Do not collate "ا" and "و" in Persian language for category sort

https://gerrit.wikimedia.org/r/300001

That patch doesn't seem to resolve the issue. Ladsgroup and I had a pretty long chat on #mediawiki on IRC trying to find out why and we didn't really get far.

We recently upgraded WMF's version of ICU from 4.8 to 52 (a.k.a 5.2) per T86096. My suspicion was that the three problematic characters (ء ا و) are now ordered with only secondary or tertiary difference, rather than primary as before, and they're rejected by the "Primary collision" case in IcuCollation::fetchFirstLetterData(). But they don't seem to be defined that way in http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/fa.txt, so I really don't know what's happening.

Category for testing: http://fa.wikipedia.beta.wmflabs.org/wiki/رده:صفحه‌های_ابهام‌زدایی (view sortkeys with the API: ApiSandbox link).

Ladsgroup added a comment.EditedJul 20 2016, 9:10 PM

@Bawolff : Hey, do you know why it doesn't work? http://fa.wikipedia.beta.wmflabs.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D9%81%D8%AD%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%D8%A8%D9%87%D8%A7%D9%85%E2%80%8C%D8%B2%D8%AF%D8%A7%DB%8C%DB%8C

Pages started with "ا" or "و" must not be in section (header?) called "ء". They should have their own section. We purged the cache in "fawiki:first-letters:fa:fa:52.1:2" in IcuCollation, still not working.

> echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA1" ) );
2627
> echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA7" ) );
2627
> echo Collation::singleton()->getICUVersion()
52.1
>

So yeah, they appearently aren't primary different in uca-fa, but do seem to be on https://ssl.icu-project.org/icu-bin/collation.html so maybe this changed in the unicode allkeys file between unicode versions

or actually, they have different primary weights in arabic and english collation, so its farsi tailoring related

I wonder if its related to http://unicode.org/cldr/trac/ticket/6738 (Just looking at recent changes to the collation data)

So honestly, I'm really not sure.

Don't suppose we can upgrade libicu :P

Ladsgroup added a comment.EditedJul 22 2016, 10:30 PM

Thanks for spending time on this. This is a huge issue in Persian Wikis (it's like putting articles starting with "E" and "U" under "Ê" in English Wikipedia). I wish I could be more helpful on finding out what is wrong.

I don't get the issue locally with ICU 55.1 (on Ubuntu 15.10):

@Bawolff Can we just special-case these three characters in IcuCollation::fetchFirstLetterData(), so that they are kept even with primary collisions?

@Bawolff Can we just special-case these three characters in IcuCollation::fetchFirstLetterData(), so that they are kept even with primary collisions?

If we did that, the headers would bounce back and forth, since they would be intermixed in the actual sorting output

I guess worst case, we could do something similar to xx-uca-et

Just to be clear, Does updating to ICU 55.1 would solve the issue? If it does, Can we just update the ICU version?
If not. I start making the xx-uca-fa.

It probably would, but it probably can't be done easily / on short notice. But maybe it can. You could file a task in Operations to ask folks to look into the possibility.

Change 302096 had a related patch set uploaded (by Ladsgroup):
[WIP] Add CollationFa

https://gerrit.wikimedia.org/r/302096

@Bawolff and @matmarex: I made this patch ready. Can you take a look at it? Thanks.

Change 302096 merged by jenkins-bot:
Add CollationFa

https://gerrit.wikimedia.org/r/302096

@kaldari Will you take CommTech be taking care of deploying the
collation change?

Change 327507 had a related patch set uploaded (by Ladsgroup):
Make fawiki in beta use xx-uca-fa

https://gerrit.wikimedia.org/r/327507

Change 327507 merged by jenkins-bot:
Make fawiki in beta use xx-uca-fa

https://gerrit.wikimedia.org/r/327507

Mentioned in SAL (#wikimedia-releng) [2016-12-15T14:48:22Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)

kaldari closed this task as Resolved.Dec 15 2016, 10:18 PM
kaldari claimed this task.

@matmarex: Looks like @Ladsgroup beat me to it :)

kaldari reopened this task as Open.Dec 15 2016, 10:19 PM

Nevermind, that was only on Beta Cluster. CommTech can do the actual deployment, unless @Ladsgroup wants to handle that as well.

I can't confirm that it's fixed in beta an example. If we can be sure it's fixed, I'll do the prod part too.

@matmarex: One of users has suggested we remove the tailorings and leave it to the ICU. What do you think?

I do not see how that would help. (Although I also do not see why the current approach did not actually work…) Unless you mean actually switching from uca-fa to uca-default or something like that?

I spent some time today setting up a Ubuntu 14 virtual machine with ICU 52 to test this. It doesn't work because the "Remove duplicate prefixes" code in IcuCollation is still removing the letters, because the primary sortkey for ء is still a prefix of the tertiary sortkeys for ا and و.

Change 329493 had a related patch set uploaded (by Bartosz Dziewoński):
CollationFa: Third time's the charm

https://gerrit.wikimedia.org/r/329493

Huji added a subscriber: Huji.Dec 28 2016, 5:30 PM

Change 329493 merged by jenkins-bot:
CollationFa: Third time's the charm

https://gerrit.wikimedia.org/r/329493

Mentioned in SAL (#wikimedia-releng) [2017-01-07T10:17:26Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)

Okay, I cleaned the cache as much as possible. Pages with one letter get into their right place but pages with more than one letter stay under incorrect one: here's the example. I even made a new page to be sure it's not because of caching.

Change 354598 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff is getting it fixed in the hackathon (Thanks!) and I'd say let's merge it because that ICU is not being used in production and only beta cluster and we can simply test it by merging it.

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

I suppose we need to run the updateCollation.php script on beta now to verify?

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff If at all convenient, please do add tests for this.

I suppose we need to run the updateCollation.php script on beta now to verify?

With --force

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff If at all convenient, please do add tests for this.

https://gerrit.wikimedia.org/r/356067

Change 357951 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa

https://gerrit.wikimedia.org/r/357951

Change 357951 merged by jenkins-bot:
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa

https://gerrit.wikimedia.org/r/357951

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:21:16Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:22:42Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikisource --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:23:14Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiktionary --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:39Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikibooks --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:49Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikinews --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:25:30Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikiquote --previous-collation=uca-fa

hashar added a subscriber: hashar.Jun 12 2017, 1:28 PM

Most wikis have been updated. fawiki is still ongoing though since it is quite large.

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:38:07Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Change Persian Wikis from uca-fa to xx-uca-fa - T139110 (duration: 00m 41s)

fawikisourcehas some left over:

[fawikisource]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uca-fa       |       12 |
| xx-uca-fa    |    53010 |
+--------------+----------+
2 rows in set (0.03 sec)

We would probably want to switch fawikivoyage which currently uses uppercase has a collation.

fawikisourcehas some left over:

[fawikisource]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uca-fa       |       12 |
| xx-uca-fa    |    53010 |
+--------------+----------+
2 rows in set (0.03 sec)

We would probably want to switch fawikivoyage which currently uses uppercase has a collation.

Is that from tool labs or actual production? In the past replication issues have lead to tool labs reporting there are still rows when there actually isnt. In any case its fine to run the maintenance script multiple times (as long as you dont use --force) if some rows are left.

The update script has completed on fawiki (was: mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa ).

@Bawolff the query comes the production database.

I guess I can close this fawiki task and the bawiki one that have a similar issue (T162823#3341107 ). Then eventually we can fill a new task regarding categorylinks being out of sync?

matmarex closed this task as Resolved.Jun 12 2017, 8:23 PM