Page MenuHomePhabricator

uca-fa collation shows pages starting with ا incorrectly under ء
Closed, ResolvedPublic

Description

Recently categories at fa.wikipedia sort wrongly the pages which are started with ا and category show them at ء group
for example at here the page بو اوتاس (the second page at the category) should be under ا not ء.
In my opinion some changes since last month cause this bug we don't have this bug before last month and it may be because of uca-fa collation or ICU
uca-fa's story started from T32287

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Aklapper renamed this task from uca-fa collation has bug to uca-fa collation shows pages starting with ا incorrectly under ء.Jul 3 2016, 9:44 PM

Change 300001 had a related patch set uploaded (by Ladsgroup):
Do not collate "ا" in Persian language for category sort

https://gerrit.wikimedia.org/r/300001

The characters have this issue: "ا" and "و"

Change 300001 merged by jenkins-bot:
Do not collate "ا" and "و" in Persian language for category sort

https://gerrit.wikimedia.org/r/300001

That patch doesn't seem to resolve the issue. Ladsgroup and I had a pretty long chat on #mediawiki on IRC trying to find out why and we didn't really get far.

We recently upgraded WMF's version of ICU from 4.8 to 52 (a.k.a 5.2) per T86096. My suspicion was that the three problematic characters (ء ا و) are now ordered with only secondary or tertiary difference, rather than primary as before, and they're rejected by the "Primary collision" case in IcuCollation::fetchFirstLetterData(). But they don't seem to be defined that way in http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/fa.txt, so I really don't know what's happening.

Category for testing: http://fa.wikipedia.beta.wmflabs.org/wiki/رده:صفحه‌های_ابهام‌زدایی (view sortkeys with the API: ApiSandbox link).

@Bawolff : Hey, do you know why it doesn't work? http://fa.wikipedia.beta.wmflabs.org/wiki/%D8%B1%D8%AF%D9%87:%D8%B5%D9%81%D8%AD%D9%87%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%D8%A8%D9%87%D8%A7%D9%85%E2%80%8C%D8%B2%D8%AF%D8%A7%DB%8C%DB%8C

Pages started with "ا" or "و" must not be in section (header?) called "ء". They should have their own section. We purged the cache in "fawiki:first-letters:fa:fa:52.1:2" in IcuCollation, still not working.

> echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA1" ) );
2627
> echo bin2hex( Collation::singleton()->getPrimarySortkey( "\xD8\xA7" ) );
2627
> echo Collation::singleton()->getICUVersion()
52.1
>

So yeah, they appearently aren't primary different in uca-fa, but do seem to be on https://ssl.icu-project.org/icu-bin/collation.html so maybe this changed in the unicode allkeys file between unicode versions

or actually, they have different primary weights in arabic and english collation, so its farsi tailoring related

I wonder if its related to http://unicode.org/cldr/trac/ticket/6738 (Just looking at recent changes to the collation data)

So honestly, I'm really not sure.

Don't suppose we can upgrade libicu :P

Thanks for spending time on this. This is a huge issue in Persian Wikis (it's like putting articles starting with "E" and "U" under "Ê" in English Wikipedia). I wish I could be more helpful on finding out what is wrong.

I don't get the issue locally with ICU 55.1 (on Ubuntu 15.10):

pasted_file (946×1 px, 68 KB)

@Bawolff Can we just special-case these three characters in IcuCollation::fetchFirstLetterData(), so that they are kept even with primary collisions?

@Bawolff Can we just special-case these three characters in IcuCollation::fetchFirstLetterData(), so that they are kept even with primary collisions?

If we did that, the headers would bounce back and forth, since they would be intermixed in the actual sorting output

I guess worst case, we could do something similar to xx-uca-et

Just to be clear, Does updating to ICU 55.1 would solve the issue? If it does, Can we just update the ICU version?
If not. I start making the xx-uca-fa.

It probably would, but it probably can't be done easily / on short notice. But maybe it can. You could file a task in SRE to ask folks to look into the possibility.

Change 302096 had a related patch set uploaded (by Ladsgroup):
[WIP] Add CollationFa

https://gerrit.wikimedia.org/r/302096

@Bawolff and @matmarex: I made this patch ready. Can you take a look at it? Thanks.

@kaldari Will you take CommTech be taking care of deploying the
collation change?

Change 327507 had a related patch set uploaded (by Ladsgroup):
Make fawiki in beta use xx-uca-fa

https://gerrit.wikimedia.org/r/327507

Change 327507 merged by jenkins-bot:
Make fawiki in beta use xx-uca-fa

https://gerrit.wikimedia.org/r/327507

Mentioned in SAL (#wikimedia-releng) [2016-12-15T14:48:22Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)

kaldari claimed this task.

@matmarex: Looks like @Ladsgroup beat me to it :)

Nevermind, that was only on Beta Cluster. CommTech can do the actual deployment, unless @Ladsgroup wants to handle that as well.

I can't confirm that it's fixed in beta an example. If we can be sure it's fixed, I'll do the prod part too.

@matmarex: One of users has suggested we remove the tailorings and leave it to the ICU. What do you think?

I do not see how that would help. (Although I also do not see why the current approach did not actually work…) Unless you mean actually switching from uca-fa to uca-default or something like that?

I spent some time today setting up a Ubuntu 14 virtual machine with ICU 52 to test this. It doesn't work because the "Remove duplicate prefixes" code in IcuCollation is still removing the letters, because the primary sortkey for ء is still a prefix of the tertiary sortkeys for ا and و.

Change 329493 had a related patch set uploaded (by Bartosz Dziewoński):
CollationFa: Third time's the charm

https://gerrit.wikimedia.org/r/329493

Change 329493 merged by jenkins-bot:
CollationFa: Third time's the charm

https://gerrit.wikimedia.org/r/329493

Mentioned in SAL (#wikimedia-releng) [2017-01-07T10:17:26Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript updateCollation.php --wiki=fawiki (T139110)

Okay, I cleaned the cache as much as possible. Pages with one letter get into their right place but pages with more than one letter stay under incorrect one: here's the example. I even made a new page to be sure it's not because of caching.

Change 354598 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff is getting it fixed in the hackathon (Thanks!) and I'd say let's merge it because that ICU is not being used in production and only beta cluster and we can simply test it by merging it.

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

I suppose we need to run the updateCollation.php script on beta now to verify?

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff If at all convenient, please do add tests for this.

I suppose we need to run the updateCollation.php script on beta now to verify?

With --force

Change 354598 merged by jenkins-bot:
[mediawiki/core@master] Hack around icu breakage for fa sorting

https://gerrit.wikimedia.org/r/354598

@Bawolff If at all convenient, please do add tests for this.

https://gerrit.wikimedia.org/r/356067

Change 357951 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa

https://gerrit.wikimedia.org/r/357951

Change 357951 merged by jenkins-bot:
[operations/mediawiki-config@master] Change Persian Wikis from uca-fa to xx-uca-fa

https://gerrit.wikimedia.org/r/357951

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:21:16Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:22:42Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikisource --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:23:14Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawiktionary --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:39Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikibooks --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:24:49Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikinews --previous-collation=uca-fa

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:25:30Z] <hashar> terbium: for T139110 mwscript updateCollation.php --wiki=fawikiquote --previous-collation=uca-fa

Most wikis have been updated. fawiki is still ongoing though since it is quite large.

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:38:07Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Change Persian Wikis from uca-fa to xx-uca-fa - T139110 (duration: 00m 41s)

fawikisourcehas some left over:

[fawikisource]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uca-fa       |       12 |
| xx-uca-fa    |    53010 |
+--------------+----------+
2 rows in set (0.03 sec)

We would probably want to switch fawikivoyage which currently uses uppercase has a collation.

fawikisourcehas some left over:

[fawikisource]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uca-fa       |       12 |
| xx-uca-fa    |    53010 |
+--------------+----------+
2 rows in set (0.03 sec)

We would probably want to switch fawikivoyage which currently uses uppercase has a collation.

Is that from tool labs or actual production? In the past replication issues have lead to tool labs reporting there are still rows when there actually isnt. In any case its fine to run the maintenance script multiple times (as long as you dont use --force) if some rows are left.

The update script has completed on fawiki (was: mwscript updateCollation.php --wiki=fawiki --previous-collation=uca-fa ).

@Bawolff the query comes the production database.

I guess I can close this fawiki task and the bawiki one that have a similar issue (T162823#3341107 ). Then eventually we can fill a new task regarding categorylinks being out of sync?

Change 433725 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/core@master] Remove everything related to CollationFa

https://gerrit.wikimedia.org/r/433725

Change 433725 merged by jenkins-bot:
[mediawiki/core@master] Remove everything related to CollationFa

https://gerrit.wikimedia.org/r/433725