Page MenuHomePhabricator

sorting order in categories for sv.wikisource
Closed, ResolvedPublic

Description

Author: la.vallen

Description:
It looks like you have solved bug 45446 for sv.wikipedia.

On sv.Wikisource, we would like to have the same feature, but with a small change, if possible. We are mainly dealing with older texts, therfore an older sorting order is more valid. The difference from Wikipedia should be that the letter 'W' should be regarded as another way of writing the letter 'V', they should have the same priority in sorting order. 'Wallenberg' should be listed before 'Vennerström' under label 'V'.

Community talk: "sv.wikisource:Wikisource:Mötesplatsen#Sorteringsordning i kategorier"
Two users has agreed this far, all of the active sysops.

Look in the article of "W" on sv.wikipedia for details of why we want this solution. The letter W officially became a part of the Swedish alphabet as late as 2006. On Wikisource, we are dealing mainly with texts from 19'th century.


Version: wmf-deployment
Severity: enhancement

Details

Reference
bz46058

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 1:13 AM
bzimport set Reference to bz46058.
bzimport added a subscriber: Unknown Object (MLST).
bzimport created this task.Mar 13 2013, 6:47 AM

I'll have to look into it; not sure if it's possible, and if it is, I'll have to figure out how to configure it.

Also, there is a patch underway to make setting different collations on a per-category basis possible, see bug 44667 (it's a part of that branch). You might want to wait until it's available, to be able to set V=W for Swedish-language categories, and "normal" sorting for the rest.

Icu seems to support making custom collations at run time (based on docs). However php's intl extension doesnt seem to expose this. So we are probably left with hacking it over top (aka turn w to v before feeding it to icu). Or somehow getting upstream to make a sv historical collation (they already have 2 sv collations - reformed and normal) however that would probably be a difficult process I imagine. I suppose third option is getting php upstream to expose custom collation maling methods

To clarify, when you say sort the same way do you mean totally identical, or just primary identical? (The sort algorithm has 3 levels. We check the primary level first ( ie different letters: A vs B) if there is a tie on that level for all letters then we move on to check accents (roughly). If there is a tie again we move on to checking case distinction. In your case it sounds like you would want V and W to be the same on the primary level but different on the secondary level - is that the case? (Which might be a moot point since the hack over top solution would only allow making them identical.)

-shell. Needs new code written before shell can do anything.

la.vallen wrote:

(In reply to comment #3)

Icu seems to support making custom collations at run time (based on docs).
However php's intl extension doesnt seem to expose this. So we are probably
left with hacking it over top (aka turn w to v before feeding it to icu). Or
somehow getting upstream to make a sv historical collation (they already
have 2
sv collations - reformed and normal) however that would probably be a
difficult
process I imagine. I suppose third option is getting php upstream to expose
custom collation maling methods
To clarify, when you say sort the same way do you mean totally identical, or
just primary identical? (The sort algorithm has 3 levels. We check the
primary
level first ( ie different letters: A vs B) if there is a tie on that level
for
all letters then we move on to check accents (roughly). If there is a tie
again
we move on to checking case distinction. In your case it sounds like you
would
want V and W to be the same on the primary level but different on the
secondary
level - is that the case? (Which might be a moot point since the hack over
top
solution would only allow making them identical.)

If there are two pages with the defaultsort "Vallenberg" and "Wallenberg", I think "V" should be sorted before "W" as if "W" was a diacritic of "V", but it is not critical. Earlier "W" was just regarded as another way of writing "V". The letter "W" was almost only used in names and foreign words. Then somebody invented 'World Wide Web', and you know the rest of the story better than me...

la.vallen wrote:

(In reply to comment #2)

Also, there is a patch underway to make setting different collations on a
per-category basis possible, see bug 44667 (it's a part of that branch). You
might want to wait until it's available, to be able to set V=W for
Swedish-language categories, and "normal" sorting for the rest.

Can be a good idea, but is is not essential. I think for example V=W content-namespaces, but normal Swedish settings for other namespaces, like User: and Project:.

la.vallen wrote:

Can You meanwhile fix the ABC...ZÄÅÖ-problem (to ÅÄÖ) like you did for sv.wikipedia?

(In reply to comment #7)

Can You meanwhile fix the ABC...ZÄÅÖ-problem (to ÅÄÖ) like you did for
sv.wikipedia?

If you mean in exactly the same way as for sv.wikipedia (including the accented letters behavior change), then yes, but this is currently blocked by bug 46036.

(In reply to comment #3)

Icu seems to support making custom collations at run time (based on docs).
However php's intl extension doesnt seem to expose this. So we are probably
left with hacking it over top (aka turn w to v before feeding it to icu). Or
somehow getting upstream to make a sv historical collation (they already
have 2
sv collations - reformed and normal) however that would probably be a
difficult
process I imagine. I suppose third option is getting php upstream to expose
custom collation maling methods
To clarify, when you say sort the same way do you mean totally identical, or
just primary identical? (The sort algorithm has 3 levels. We check the
primary
level first ( ie different letters: A vs B) if there is a tie on that level
for
all letters then we move on to check accents (roughly). If there is a tie
again
we move on to checking case distinction. In your case it sounds like you
would
want V and W to be the same on the primary level but different on the
secondary
level - is that the case? (Which might be a moot point since the hack over
top
solution would only allow making them identical.)

I'm sorry, I made a mistake looking at the available collations. intl supports a "standard" collation (vs "reformed" which is what sv.wikipedia is using). The standard collation has rules:

"&D<<đ<<<Đ<<ð<<<Ð"
"&t<<<þ/h"
"&T<<<Þ/H"
"&v<<<V<<w<<<W"
"&Y<<ü<<<Ü<<ű<<<Ű"
"&[before 1]ǀ<å<<<Å<ä<<<Ä<<æ<<<Æ<<ę<<<Ę<ö<<<Ö<<ø<<<Ø<<ő<<<Ő<<œ<<<Œ<"
"<ô<<<Ô"

Which means that V would be treated as Secondary different from W, which is what you want. (The collation can be triggered with a locale name sv@collation=standard . In theory, I thought sv-u-co-standard should also trigger it, but it doesn't seem to...)

for record, the change in MW (would still need a wmf config change to enable): gerrit change 55498

Gerrit change #55498 merge, this should now be possible to do.

Change 75351 had a related patch set uploaded by Reedy:
Category sorting order for sv.wikisource

https://gerrit.wikimedia.org/r/75351

Change 75351 merged by jenkins-bot:
Category sorting order for sv.wikisource

https://gerrit.wikimedia.org/r/75351

Reedy added a comment.Jul 23 2013, 6:11 PM

reedy@tin:/a/common/php-1.22wmf11$ mwscript maintenance/updateCollation.php --wiki=svwikisource --previous-collation=uppercase
Fixing collation for 82831 rows.
Selecting next 10000 rows... processing...10000 done.
Selecting next 10000 rows... processing...20000 done.
Selecting next 10000 rows... processing...30000 done.
Selecting next 10000 rows... processing...40000 done.
Selecting next 10000 rows... processing...50000 done.
Selecting next 10000 rows... processing...60000 done.
Selecting next 10000 rows... processing...70000 done.
Selecting next 10000 rows... processing...80000 done.
Selecting next 10000 rows... processing...82831 done.
82831 rows processed