Page MenuHomePhabricator

Switch German Wikipedia to uca-de category collation
Closed, DeclinedPublic

Description

Per the TCB survey, we should switch the German Wikipedia to uca-de category collation once T58041 is resolved.

This will be done by changing the $wgCategoryCollation variable to 'uca-de' in InitializeSettings.php and running the maintenance script to rebuild the sortkeys.

Event Timeline

kaldari moved this task from New & TBD Tickets to Bug backlog on the Community-Tech board.

@kaldari T125774 took only 3 hours, so to wait T58041 resolution would cost us to spend more time to manage this task than to launch the script and to come back some hours later see if all worked well.

@Dereckson: True, but I'm not as worried about our time as the fact that community members are more likely to notice inconsistent sorting if takes several hours to complete. The old script re-sorts everything at once rather than 1 category at a time, so there will be inconsistent sorting all over the place until it is finished running. Personally, I would prefer to wait for T58041, but others may agree with you.

Perhaps we could mitigate this drawback deploying the patch at an evening SWAT, 1 am in Germany, to minimize impact to the readership.

By looking at https://de.wikipedia.org/wiki/Wikipedia_Diskussion:Projektneuheiten/Archiv/2014#CategoryCollation, especially the statement "The rule for generating our sortkeys for numbers with a variable amount of digits needs to be adjusted, and all existing tools need to be updated" (translated), it seems to me that some effort from the German community is needed after we run the script. If that is true, we should not run the script without announcing it on German Wikipedia first and give them some time to prepare.
@Bmueller and @thiemowmde probably know more about the reasoning behind it.

The German community came up with a rule for DEFAULTSORT keys, described in https://de.wikipedia.org/wiki/Hilfe:Kategorien#8._Regel:_Ziffern_am_Lemma-Anfang, that goes like this:

  • #:1
  • #:2
  • #::10
  • #::::2016

This uses the fact that # (U+0023) comes before the digits and letters, and : (U+003A) comes after the digits but before the letters.

With UCA (the German article is much, much longer, by the way) that : will move before the digits, turning this upside down partially (4-digit sort keys first, in numeric order, then 3-digit sort keys, and so on). Possible ways to deal with this:

  • See if it's possible to configure UCA in a way that : stays where it is.
  • Announce this. Change it. Sure, all these numeric sort keys must be changed then, but a lot can be removed completely. In my humble opinion (and most agree with this) UCA is much, much better than what we currently have and totally worth the trouble.

See if it's possible to configure UCA in a way that : stays where it is.

I tried pretty much every possible collation setting option, but : still comes before the digits in all cases.

Yeah, we'll need to have a community discussion before we run this on German WP. I think the best conclusion would be to remove the # and : defaultsorts, so that people don't have to worry about adding and fixing them all the time. But it's up to them whether they want to do that or not.

@Bmueller said that she/TCB can run that conversation.

See if it's possible to configure UCA in a way that : stays where it is.

I tried pretty much every possible collation setting option, but : still comes before the digits in all cases.

The option you are looking for (in https://ssl.icu-project.org/icu-bin/collation.html) is probably numeric=on and alternate=shifted. However I'm not sure if those are enable-able from php, so might be a moot point.

See if it's possible to configure UCA in a way that : stays where it is.

I tried pretty much every possible collation setting option, but : still comes before the digits in all cases.

The option you are looking for (in https://ssl.icu-project.org/icu-bin/collation.html) is probably numeric=on and alternate=shifted. However I'm not sure if those are enable-able from php, so might be a moot point.

Actually you can trigger that option from php. So the only downside is that it would mean all puncutation is essentially ignored. (Or alternatively only used as tie breaker). Its unclear if that would be a negative in practise.

@Bmueller: Any update from the German community? We're ready when they are.

See if it's possible to configure UCA in a way that : stays where it is.

I tried pretty much every possible collation setting option, but : still comes before the digits in all cases.

The option you are looking for (in https://ssl.icu-project.org/icu-bin/collation.html) is probably numeric=on and alternate=shifted. However I'm not sure if those are enable-able from php, so might be a moot point.

Actually you can trigger that option from php. So the only downside is that it would mean all puncutation is essentially ignored. (Or alternatively only used as tie breaker). Its unclear if that would be a negative in practise.

Since a exclamation mark at the start of a sortkey is often used to place lists at the beginning of a category, this would be negative. So it's not possible to use alternate=shifted, though it definitely is an interesting option otherwise.

Change 301550 had a related patch set uploaded (by Raimond Spekking):
Labs: Set CategoryCollation for dewiki to 'uca-de-u-kn'

https://gerrit.wikimedia.org/r/301550

Here is the pre-test sorting for http://de.wikipedia.beta.wmflabs.org/wiki/Kategorie:UCA-Sortierung:

  • UCA-Sortierung-Hauptartikel

1

  • 1001 Nacht
  • 11 Freunde

2

  • 20.000 Meilen unter dem Meer

4

  • 4 Fäuste für ein Halleluja

5

  • 5 Freunde machen eine Entdeckung

9

  • 99 Luftballons

K

  • Katze 17
  • Katze 17b
  • Katze 5
  • Katze mausi
  • Katze muddy
  • Katze Muezzin
  • Katze Muffin
  • Katze muhse
  • Katze Muse
  • Katze myde
  • Katze müde
  • Katze mühsam

U

  • Untersuchungsausschuss

Ü

  • Überschuldung

Change 301550 merged by jenkins-bot:
Labs: Set CategoryCollation for dewiki to 'uca-de-u-kn'

https://gerrit.wikimedia.org/r/301550

Mentioned in SAL [2016-08-01T23:05:16Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings-labs.php: Labs: Set CategoryCollation for dewiki to 'uca-de-u-kn' (T128806) (duration: 00m 38s)

Here is the post-test sorting at http://de.wikipedia.beta.wmflabs.org/wiki/Kategorie:UCA-Sortierung:

  • UCA-Sortierung-Hauptartikel

0–9

  • 4 Fäuste für ein Halleluja
  • 5 Freunde machen eine Entdeckung
  • 11 Freunde
  • 20.000 Meilen unter dem Meer
  • 99 Luftballons
  • 1001 Nacht

K

  • Katze 5
  • Katze 17
  • Katze 17b
  • Katze mausi
  • Katze muddy
  • Katze müde
  • Katze Muezzin
  • Katze Muffin
  • Katze mühsam
  • Katze muhse
  • Katze Muse
  • Katze myde

U

  • Überschuldung
  • Untersuchungsausschuss

Is there something blocking this task? The actual configuration change (whether for dewiki only, or for all German projects) would be trivial to prepare if wanted.

@TTO: We just need to get the go-ahead from the German community. I believe @Bmueller was going to start a community discussion about it there, but I'm not sure if she ever got a chance to do that. Basically we just need a public discussion there that endorses the idea before we flip the switch.

I checked and indeed, this is still missing a final announcement at https://de.wikipedia.org/wiki/Hilfe_Diskussion:Kategorien, https://de.wikipedia.org/wiki/Wikipedia_Diskussion:WikiProjekt_Kategorien, and a few other places including the German village pump ("Kurier"). Several discussions with individual German Wikipedians already happened (including users subscribed here). They have been able to test at https://de.wikipedia.beta.wmflabs.org/wiki/Kategorie:UCA-Sortierung. From what I see the mood is very positive, but it should not be changed without said announcement. @Bmueller?

Hi, there already was a public discussion with positive result conducted by dewiki folks (@kaldari, we were talking about that some weeks ago). But people wanted to do some data analysis and check special cases etc. before switching, and this to do list is still open (see here: https://de.wikipedia.org/wiki/Wikipedia:Bots/Anfragen#Sortierschl.C3.BCssel_vereinfachen). They'll poke you once all points from the list are covered and once they're ready to switch.

Thanks for the update @Bmueller. That looks like a long checklist :P

Kaldari, thank you for keeping an eye on this issue and your kind check.
@all:

Migration process is on the way at German WP, but at snail speed.

  • This moment dump analysis is looking for currently used special character sorting order which would be affected, and other special cases like aircraft models.
  • Then, bot runs are be to prepared also for numerical order key change which is known to be switched; otherwise old and new mode would not match and confuse users.
  • When we are ready for migration we will arrange a date for configuration change, running the bot the same day and make public announcement some days ahead.

Another issue is that biographic people are going to resign from DEFAULTSORT in articles, generating by template now.

  • Some 750.000 DEFAULTSORT may be removed next years then.
  • To avoid confusion for authors, there shall be only one swith for all sorting order issues, and simply one new set of editing rules to be anticipated.

Entirely new documentation and guideline pages are already written; only 15 % of the traditional ASCII rules will be needed in future.

I do not expect any progress in 2016; will call next HNY.

@PerfektesChaos and @Bmueller Where is this at btw ? We're half a year on since the last check in.

Urbanecm subscribed.

Mass-declining of all old "Blocked on community consensus" site request tasks. If this is still wanted, please make sure community consensus was reached and if so, please re-open this task and link to the discussion.