Page MenuHomePhabricator

Test numeric sorting on Swedish Wikipedia
Closed, ResolvedPublic1 Estimated Story Points

Description

Before we deploy numeric sorting on English Wikipedia, we should test it on a real wiki with actual categories and defaultsort keys. Johan has offered to start a proposal on Swedish Wikipedia to turn on numeric sorting so that it can be more thoroughly tested.

The actual change will be switching svwiki's collation from uca-sv to uca-sv-u-kn.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Will start a discussion on Monday.

Will the number of bot-created articles be a problem? They're categorized as well, after all.

It's apparently not, so the question has been asked.

Discussion on Swedish Wikipedia.

Question that came up: uca-sv-u-kn takes Swedish digit grouping and decimal comma into account, right? Have we specifically tested it?

(Swedish digit grouping differs from English. 100.000,00 would usually be written 100 000,00 in Swedish, occasionally as 100,000.00, as opposed to 100.000,00 in English.)

Do we have a nifty solution for articles that have solved this problem using DEFAULTSORT already, e.g {{STANDARDSORTERING:0011 Freunde}}? Maybe running a script to find articles where defaultsort starts with 0/00/000 but otherwise matches the article name? Or will it even affect things at all? The numerical value is still the same.

Or adds them somewhere, not necessarily in the beginning of the article name.

@Johan: Any time a number is separated with a non-numeric character, UCA collation will treat the separator as a string and the separated pieces as separate numbers. This is true in English, Swedish, and all UCA collations. Separators are considered ambiguous characters and thus not treated as parts of numbers. This can be worked around (in the case of large integers) by removing the separators in DEFAULTSORT keys. So for example, if you had an article entitled "9,999 bottles of beer", you could add a DEFAULTSORT key of "9999 bottles of beer".

@Johan: I'm not aware of any workaround for decimals though.

@Johan: Let's try to wrap up that discussion next week if possible.

Change 304262 had a related patch set uploaded (by Kaldari):
Updating $tailoringFirstLetters for Swedish

https://gerrit.wikimedia.org/r/304262

Change 304262 abandoned by Kaldari:
Updating $tailoringFirstLetters for Swedish Per https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll/sv.txt

Reason:
Already in the list :P

https://gerrit.wikimedia.org/r/304262

Posted in the discussion that I'm reading consensus as "the change is fine as long as it doesn't sabotage anything that's working right now". Giving folks a chance to protest, but I don't see any reason anyone would if they haven't so far.

No new posts for a couple of days.

No further protests. I'd say we can carefully go ahead.

Change 306216 had a related patch set uploaded (by Kaldari):
Switching Swedish Wikipedia to uca-sv-u-kn collation

https://gerrit.wikimedia.org/r/306216

@Johan: Deployment is scheduled for 11-noon Pacific time today.

OK, I'll mention it on the Village Pump.

kaldari set the point value for this task to 1.Aug 23 2016, 5:09 PM

Change 306216 merged by jenkins-bot:
Switching Swedish Wikipedia to uca-sv-u-kn collation

https://gerrit.wikimedia.org/r/306216

Mentioned in SAL [2016-08-23T18:12:13Z] <thcipriani@tin> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:306216|Switching Swedish Wikipedia to uca-sv-u-kn collation (T142113)]] (duration: 00m 58s)

Deployed to Swedish Wikipedia:
https://sv.wikipedia.org/wiki/Kategori:Musikgrupper_med_syskon

Seems to be working well so far. updateCollation.php script is still running. Will probably take a few hours to finish.

No problems reported so far, except for that it doesn't work well with separators. But we knew that.