Page MenuHomePhabricator

Add proper category collation for the Northern Sami Wikipedia
Closed, ResolvedPublic

Description

The category collation (sorting order) for the Northern Sami Wikipedia is incorrect, as best exemplified with the category for letters of the Sami alphabet (compare to the alphabetical order). According to https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation support for Northern Sami exists, but is not deployed.

I have asked for community input here, and will upload a patch for this shortly, so that when a week has passed it can be added (unless, of course, there are objections from the community, but I can't imagine why there would be).

Event Timeline

Restricted Application added subscribers: jhsoby, Aklapper. · View Herald TranscriptNov 28 2017, 1:44 PM

Change 393762 had a related patch set uploaded (by Jon Harald Søby; owner: Jon Harald Søby):
[operations/mediawiki-config@master] Add category collation for sewiki

https://gerrit.wikimedia.org/r/393762

jhsoby-WMNO updated the task description. (Show Details)Nov 28 2017, 2:07 PM
jhsoby-WMNO moved this task from Incoming to In progress on the WMNO-Sami board.

It has now been one week without any objections on-wiki, so I'm adding this to today's SWAT.

Change 393762 merged by jenkins-bot:
[operations/mediawiki-config@master] Add category collation for sewiki

https://gerrit.wikimedia.org/r/393762

zfilipin@terbium:~$ mwscript updateCollation.php --wiki=sewiki --previous-collation=uppercase
Fixing collation for 26031 rows.
Selecting next 100 rows... processing...100 done.
Selecting next 100 rows... processing...200 done.
Selecting next 100 rows... processing...300 done.
Selecting next 100 rows... processing...400 done.
Selecting next 100 rows... processing...500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...600 done.
Selecting next 100 rows... processing...700 done.
Selecting next 100 rows... processing...800 done.
Selecting next 100 rows... processing...900 done.
Selecting next 100 rows... processing...1000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...1100 done.
Selecting next 100 rows... processing...1200 done.
Selecting next 100 rows... processing...1300 done.
Selecting next 100 rows... processing...1400 done.
Selecting next 100 rows... processing...1500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...1600 done.
Selecting next 100 rows... processing...1700 done.
Selecting next 100 rows... processing...1800 done.
Selecting next 100 rows... processing...1900 done.
Selecting next 100 rows... processing...2000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...2100 done.
Selecting next 100 rows... processing...2200 done.
Selecting next 100 rows... processing...2300 done.
Selecting next 100 rows... processing...2400 done.
Selecting next 100 rows... processing...2500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...2600 done.
Selecting next 100 rows... processing...2700 done.
Selecting next 100 rows... processing...2800 done.
Selecting next 100 rows... processing...2900 done.
Selecting next 100 rows... processing...3000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...3100 done.
Selecting next 100 rows... processing...3200 done.
Selecting next 100 rows... processing...3300 done.
Selecting next 100 rows... processing...3400 done.
Selecting next 100 rows... processing...3500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...3600 done.
Selecting next 100 rows... processing...3700 done.
Selecting next 100 rows... processing...3800 done.
Selecting next 100 rows... processing...3900 done.
Selecting next 100 rows... processing...4000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...4100 done.
Selecting next 100 rows... processing...4200 done.
Selecting next 100 rows... processing...4300 done.
Selecting next 100 rows... processing...4400 done.
Selecting next 100 rows... processing...4500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...4600 done.
Selecting next 100 rows... processing...4700 done.
Selecting next 100 rows... processing...4800 done.
Selecting next 100 rows... processing...4900 done.
Selecting next 100 rows... processing...5000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...5100 done.
Selecting next 100 rows... processing...5200 done.
Selecting next 100 rows... processing...5300 done.
Selecting next 100 rows... processing...5400 done.
Selecting next 100 rows... processing...5500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...5600 done.
Selecting next 100 rows... processing...5700 done.
Selecting next 100 rows... processing...5800 done.
Selecting next 100 rows... processing...5900 done.
Selecting next 100 rows... processing...6000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...6100 done.
Selecting next 100 rows... processing...6200 done.
Selecting next 100 rows... processing...6300 done.
Selecting next 100 rows... processing...6400 done.
Selecting next 100 rows... processing...6500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...6600 done.
Selecting next 100 rows... processing...6700 done.
Selecting next 100 rows... processing...6800 done.
Selecting next 100 rows... processing...6900 done.
Selecting next 100 rows... processing...7000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...7100 done.
Selecting next 100 rows... processing...7200 done.
Selecting next 100 rows... processing...7300 done.
Selecting next 100 rows... processing...7400 done.
Selecting next 100 rows... processing...7500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...7600 done.
Selecting next 100 rows... processing...7700 done.
Selecting next 100 rows... processing...7800 done.
Selecting next 100 rows... processing...7900 done.
Selecting next 100 rows... processing...8000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...8100 done.
Selecting next 100 rows... processing...8200 done.
Selecting next 100 rows... processing...8300 done.
Selecting next 100 rows... processing...8400 done.
Selecting next 100 rows... processing...8500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...8600 done.
Selecting next 100 rows... processing...8700 done.
Selecting next 100 rows... processing...8800 done.
Selecting next 100 rows... processing...8900 done.
Selecting next 100 rows... processing...9000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...9100 done.
Selecting next 100 rows... processing...9200 done.
Selecting next 100 rows... processing...9300 done.
Selecting next 100 rows... processing...9400 done.
Selecting next 100 rows... processing...9500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...9600 done.
Selecting next 100 rows... processing...9700 done.
Selecting next 100 rows... processing...9800 done.
Selecting next 100 rows... processing...9900 done.
Selecting next 100 rows... processing...10000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...10100 done.
Selecting next 100 rows... processing...10200 done.
Selecting next 100 rows... processing...10300 done.
Selecting next 100 rows... processing...10400 done.
Selecting next 100 rows... processing...10500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...10600 done.
Selecting next 100 rows... processing...10700 done.
Selecting next 100 rows... processing...10800 done.
Selecting next 100 rows... processing...10900 done.
Selecting next 100 rows... processing...11000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...11100 done.
Selecting next 100 rows... processing...11200 done.
Selecting next 100 rows... processing...11300 done.
Selecting next 100 rows... processing...11400 done.
Selecting next 100 rows... processing...11500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...11600 done.
Selecting next 100 rows... processing...11700 done.
Selecting next 100 rows... processing...11800 done.
Selecting next 100 rows... processing...11900 done.
Selecting next 100 rows... processing...12000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...12100 done.
Selecting next 100 rows... processing...12200 done.
Selecting next 100 rows... processing...12300 done.
Selecting next 100 rows... processing...12400 done.
Selecting next 100 rows... processing...12500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...12600 done.
Selecting next 100 rows... processing...12700 done.
Selecting next 100 rows... processing...12800 done.
Selecting next 100 rows... processing...12900 done.
Selecting next 100 rows... processing...13000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...13100 done.
Selecting next 100 rows... processing...13200 done.
Selecting next 100 rows... processing...13300 done.
Selecting next 100 rows... processing...13400 done.
Selecting next 100 rows... processing...13500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...13600 done.
Selecting next 100 rows... processing...13700 done.
Selecting next 100 rows... processing...13800 done.
Selecting next 100 rows... processing...13900 done.
Selecting next 100 rows... processing...14000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...14100 done.
Selecting next 100 rows... processing...14200 done.
Selecting next 100 rows... processing...14300 done.
Selecting next 100 rows... processing...14400 done.
Selecting next 100 rows... processing...14500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...14600 done.
Selecting next 100 rows... processing...14700 done.
Selecting next 100 rows... processing...14800 done.
Selecting next 100 rows... processing...14900 done.
Selecting next 100 rows... processing...15000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...15100 done.
Selecting next 100 rows... processing...15200 done.
Selecting next 100 rows... processing...15300 done.
Selecting next 100 rows... processing...15400 done.
Selecting next 100 rows... processing...15500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...15600 done.
Selecting next 100 rows... processing...15700 done.
Selecting next 100 rows... processing...15800 done.
Selecting next 100 rows... processing...15900 done.
Selecting next 100 rows... processing...16000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...16100 done.
Selecting next 100 rows... processing...16200 done.
Selecting next 100 rows... processing...16300 done.
Selecting next 100 rows... processing...16400 done.
Selecting next 100 rows... processing...16500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...16600 done.
Selecting next 100 rows... processing...16700 done.
Selecting next 100 rows... processing...16800 done.
Selecting next 100 rows... processing...16900 done.
Selecting next 100 rows... processing...17000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...17100 done.
Selecting next 100 rows... processing...17200 done.
Selecting next 100 rows... processing...17300 done.
Selecting next 100 rows... processing...17400 done.
Selecting next 100 rows... processing...17500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...17600 done.
Selecting next 100 rows... processing...17700 done.
Selecting next 100 rows... processing...17800 done.
Selecting next 100 rows... processing...17900 done.
Selecting next 100 rows... processing...18000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...18100 done.
Selecting next 100 rows... processing...18200 done.
Selecting next 100 rows... processing...18300 done.
Selecting next 100 rows... processing...18400 done.
Selecting next 100 rows... processing...18500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...18600 done.
Selecting next 100 rows... processing...18700 done.
Selecting next 100 rows... processing...18800 done.
Selecting next 100 rows... processing...18900 done.
Selecting next 100 rows... processing...19000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...19100 done.
Selecting next 100 rows... processing...19200 done.
Selecting next 100 rows... processing...19300 done.
Selecting next 100 rows... processing...19400 done.
Selecting next 100 rows... processing...19500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...19600 done.
Selecting next 100 rows... processing...19700 done.
Selecting next 100 rows... processing...19800 done.
Selecting next 100 rows... processing...19900 done.
Selecting next 100 rows... processing...20000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...20100 done.
Selecting next 100 rows... processing...20200 done.
Selecting next 100 rows... processing...20300 done.
Selecting next 100 rows... processing...20400 done.
Selecting next 100 rows... processing...20500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...20600 done.
Selecting next 100 rows... processing...20700 done.
Selecting next 100 rows... processing...20800 done.
Selecting next 100 rows... processing...20900 done.
Selecting next 100 rows... processing...21000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...21100 done.
Selecting next 100 rows... processing...21200 done.
Selecting next 100 rows... processing...21300 done.
Selecting next 100 rows... processing...21400 done.
Selecting next 100 rows... processing...21500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...21600 done.
Selecting next 100 rows... processing...21700 done.
Selecting next 100 rows... processing...21800 done.
Selecting next 100 rows... processing...21900 done.
Selecting next 100 rows... processing...22000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...22100 done.
Selecting next 100 rows... processing...22200 done.
Selecting next 100 rows... processing...22300 done.
Selecting next 100 rows... processing...22400 done.
Selecting next 100 rows... processing...22500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...22600 done.
Selecting next 100 rows... processing...22700 done.
Selecting next 100 rows... processing...22800 done.
Selecting next 100 rows... processing...22900 done.
Selecting next 100 rows... processing...23000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...23100 done.
Selecting next 100 rows... processing...23200 done.
Selecting next 100 rows... processing...23300 done.
Selecting next 100 rows... processing...23400 done.
Selecting next 100 rows... processing...23500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...23600 done.
Selecting next 100 rows... processing...23700 done.
Selecting next 100 rows... processing...23800 done.
Selecting next 100 rows... processing...23900 done.
Selecting next 100 rows... processing...24000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...24100 done.
Selecting next 100 rows... processing...24200 done.
Selecting next 100 rows... processing...24300 done.
Selecting next 100 rows... processing...24400 done.
Selecting next 100 rows... processing...24500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...24600 done.
Selecting next 100 rows... processing...24700 done.
Selecting next 100 rows... processing...24800 done.
Selecting next 100 rows... processing...24900 done.
Selecting next 100 rows... processing...25000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...25100 done.
Selecting next 100 rows... processing...25200 done.
Selecting next 100 rows... processing...25300 done.
Selecting next 100 rows... processing...25400 done.
Selecting next 100 rows... processing...25500 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...25600 done.
Selecting next 100 rows... processing...25700 done.
Selecting next 100 rows... processing...25800 done.
Selecting next 100 rows... processing...25900 done.
Selecting next 100 rows... processing...26000 done.
Waiting for replica DBs ... done
Selecting next 100 rows... processing...26031 done.
26031 rows processed
zfilipin@terbium:~$ mwscript updateCollation.php --wiki=sewiki --previous-collation=uca-se-u-kn
Collations up-to-date.

Mentioned in SAL (#wikimedia-operations) [2017-12-05T14:40:14Z] <zfilipin@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:395540|Revert "Add category collation for sewiki" (T181503)]] (duration: 00m 44s)

jhsoby-WMNO added a subscriber: kaldari.EditedDec 5 2017, 3:07 PM

@kaldari, IIRC you've been working on collation before, so I'm pinging you. If you're not the right person to ask, I'd be grateful for any pointer as to who is.

The patch was merged (that set collation to "uca-se-u-kn") and the script was run (mwscript updateCollation.php --wiki=sewiki --previous-collation=uppercase), but the result was not as expected. I used this category for testing, since it has all native Northern Sami letters. The expected result was that the headings would remain, but would be reordered as follows:

  • Á after A; Č after C; Đ after D; Ŋ after N; Š after S; Ŧ after T; and Ž after Z

Instead what we got was:

  • Á sorted under the heading A; Č sorted under the heading C; Š sorted under the heading S; Ž sorted under the heading Z
  • Đ, Ŋ and Ŧ remained in the current positions

Do you have any idea what is wrong? The correct headers are set in includes/IcuCollation.php (lines 203–206). Manual:$wgCategoryCollation#Language-specific collations says that se is supported. The only thing I think might be wrong is what you, @kaldari, described in this diff and in P4286 – that list does not include se, in which case I think "uca-se-u-kn" would fall back to "uca-default-u-kn", and treat Á, Č, Š and Ž as accented variants of their base letters, and Đ, Ŋ and Ŧ as separate letters. Do you think I'm onto something here?

@jhsoby-WMNO: That's correct. Even though MediaWiki now supports collation for Northern Sami, the Wikimedia production servers don't. The steps for that to happen are:

  • Get it added into ICU library (done)
  • Wait for a new version of the PHP intl extension that has the new ICU code in it
  • Upgrade PHP on the Wikimedia production servers

Unfortunately, the last two steps may take several years.

In the meantime, I would recommend switching to uca-default-u-kn. Right now it is still defaulting to uppercase.

Thanks, @kaldari!

Is there no other way of solving this temporarily, like how xx-uac-et is set up especially for etwiki? It seems the PHP intl extension hasn't been updated for more than four years (many releases from 2011 till 2013, then nothing), so it could be a very long wait. Looking into the source code for PECL-intl on GitHub, I see that @Smalyshev is one of the main contributors there, so I'm shamelessly adding him on this bug as a sort of ping. :-)

I'm not sure about switching to uca-default, because it treats the C/Č, Đ/D, etc as equivalent, which they are not. At least with the current sorting they are sorted separately, but in the wrong order.

@jhsoby-WMNO: Yes, I believe it would be possible for us to create our own collation similar to BashkirUppercaseCollation.php.

PECL intl extension has not been updated because it has been integrated into PHP core, and all development is happening there. PECL one is there only to enable building it with PHP versions before the merge, which by now should be mostly dead.

The intl extension uses CLDR dataset from libicu, so if that's not correct, then we should check against which libicu it was built, and if there's a more recent one we could use, if ICU dataset is broken (which is possible, happened in the past) or we can't use proper one then we could use our own.

@Smalyshev: Last I checked, the intl extension we're using in production is based on ICU 52.1, which doesn't include Northern Sami (se). It is however, included in current versions of libicu. Could we swap out our intl extension for a newer build? If not, I'm fine with creating a custom collation class.

FYI, if we end up creating our own custom collation class, it should be based on https://ssl.icu-project.org/trac/browser/icu/tags/release-58-1/source/data/coll/se.txt.

Debian stretch has libicu 5.7 which also seems to have se. And stretch's intl extension is using it. So if we migrate (T174431) there, it should be improved.

Change 396007 had a related patch set uploaded (by Jon Harald Søby; owner: Jon Harald Søby):
[mediawiki/core@master] Add custom collation for Northern Sami

https://gerrit.wikimedia.org/r/396007

I uploaded a patch (as you can see) that uses the same system as BashkirUppercaseCollation.php. I haven't been able to test it locally – I've tried and failed to set up MediaWiki on localhost – so if any of you could test it I'd be very grateful.

The patch uses the letters from se.txt from T181503#3817996, but without secondary-level letters (so it only uses letters with a single ">" before them).

@jhsoby-WMNO Thanks for the patch! I tested locally by creating the pages with the same titles as the test category (https://se.wikipedia.org/wiki/Kategoriija:Sámegielaid_alfabehta) and changing the collation of my test wiki. It looks fine:

I compared your patch to the data in https://ssl.icu-project.org/trac/browser/icu/tags/release-58-1/source/data/coll/se.txt and it looks like the letter 'Ǥ' (G with stroke) is missing from the patch. Is this intentional? (According to https://en.wikipedia.org/wiki/G_with_stroke it "has been used to write Northern Sami (in an old orthography)").

(Added 'Ǥ' after 'Ǧ' to the patch, per IRC conversation with Jon.)

@jhsoby-WMNO: That's correct. Even though MediaWiki now supports collation for Northern Sami, the Wikimedia production servers don't. The steps for that to happen are:

  • Get it added into ICU library (done)
  • Wait for a new version of the PHP intl extension that has the new ICU code in it
  • Upgrade PHP on the Wikimedia production servers

Unfortunately, the last two steps may take several years.
In the meantime, I would recommend switching to uca-default-u-kn. Right now it is still defaulting to uppercase.

For reference we are probably planning to upgrade libicu 57 sometime in the next several months. See T177498

Ü and Ű are also missing, although I have no idea if they're actually useful.

Ü and Ű are also missing, although I have no idea if they're actually useful.

Those are secondary differences though, as indicated by <<:

	                "&y<<ü"
	                "<<<Ü<<ű"
	                "<<<Ű"

This means that the characters 'Y', 'Ü', 'Ű' should be considered the same when sorting, except as a tiebreaker if the words would otherwise be considered identical. They would all be placed under the heading for 'Y'. But our CustomUppercaseCollation has no support for secondary differences, only primary :(

@matmarex: I'm glad you remember how to read these files! I always forget.

To be honest I'm not sure what the rest of the syntax, other than <, << and <<<, means ;)

Change 396007 merged by jenkins-bot:
[mediawiki/core@master] Add custom collation for Northern Sami

https://gerrit.wikimedia.org/r/396007

Thanks for the help everyone! If I understand ReleaseTaggerBot correctly, this will be deployed to WMF wikis on December 14, so after that InitialiseSettings can be changed to set $wgCategoryCollation for sewiki to uppercase-se, yeah? If so I can add that to the SWAT immediately after the upgrade, unless that's too soon?

Change 396381 had a related patch set uploaded (by Jon Harald Søby; owner: Jon Harald Søby):
[operations/mediawiki-config@master] Set category collation for sewiki

https://gerrit.wikimedia.org/r/396381

Change 398388 had a related patch set uploaded (by Bartosz Dziewoński; owner: Jon Harald Søby):
[mediawiki/core@wmf/1.31.0-wmf.11] Add custom collation for Northern Sami

https://gerrit.wikimedia.org/r/398388

Change 396381 merged by jenkins-bot:
[operations/mediawiki-config@master] Set category collation for sewiki

https://gerrit.wikimedia.org/r/396381

Mentioned in SAL (#wikimedia-operations) [2017-12-15T00:14:13Z] <RoanKattouw> updateCollation.php finished on sewiki (T181503)

Change 398388 merged by jenkins-bot:
[mediawiki/core@wmf/1.31.0-wmf.11] Add custom collation for Northern Sami

https://gerrit.wikimedia.org/r/398388

jhsoby-WMNO closed this task as Resolved.Dec 15 2017, 12:35 AM
jhsoby-WMNO removed a project: Patch-For-Review.
jhsoby-WMNO moved this task from In progress to Done on the WMNO-Sami board.