Natural number sorting in category listings
Closed, ResolvedPublic5 Story Points

Description

Author: michael

Description:
Just like in a book index, category listings should sort numbers by their value, not just as dumb strings of characters.

Example: http://en.wikipedia.org/wiki/Category:Antonov, partial listing:

...
Antonov An-2
Antonov An-218
Antonov An-22
Antonov An-225
Antonov An-24
Antonov An-26
Antonov An-28
Antonov An-3
...

Of course, it should be

...
Antonov An-2
Antonov An-3
Antonov An-22
Antonov An-24
Antonov An-26
Antonov An-28
Antonov An-218
Antonov An-225
...

Details

Reference
bz6948

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes
Arbnos added a subscriber: Arbnos.Apr 13 2016, 11:51 AM

Posted an inquiry on English Wiktionary to see which style of numerical sorting they would prefer: https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2016/March#How_should_numbers_be_sorted_on_Wiktionary.3F

It seems to me that if Wiktionary also prefers natural number sorting, we may just want to switch it on for all wikis rather than overloading $wgCategoryCollation with an extra option. We would still need to put it behind a feature flag though (at least for testing).

Note that the controversey at wiktionary from years back (so quite likely nobody cares anymore) was not the numericl sorting, but the alphabetical uca sorting that is a pre-req for the numerical sorting. Specificly i think they wanted code point order so not to favour one language over the other in their multilingual wiki.

matej_suchanek removed a subscriber: wikibugs-l-list.
Automatik added a subscriber: Automatik.EditedMay 10 2016, 9:30 AM

Specificly i think they wanted code point order so not to favour one language over the other in their multilingual wiki.

That's not exact, what we need on Wiktionary is in fact to specify a collation per category (see T30397), so that ä can be sorted as a in French but not in Swedish where it is a full letter of the end of the alpahabet.

Specificly i think they wanted code point order so not to favour one language over the other in their multilingual wiki.

That's not exact, what we need on Wiktionary is in fact to specify a collation per category (see T30397), so ä can be sorted as a in French but not in Swedish where it is a full letter of the end of the alpahabet.

Ah ok, that makes sense. So it should be kept in mind that that is more work than the numerical stuff, and (given the small amount of people working on collations) probably will not happen until way after the numerical stuff is ready to go. Thus it should be kept in mind when discussing this feature with wiktionaries

@Bawolff: Where'd you get the suffix "-u-kn"? Is that following some existing convention?

Testing locally gives some weird results with the headers:

1

1 Partridge

5

5 Gold Rings

9

9 Ladies Dancing
10 Lords a Leaping
50 Cent
99 Luftballons
100 meter dash
101 Dalmations
10000 lakes

A

Alvin and the Chipmunks
Ärsenik
Aztec

Basically, every number over 9 gets put under the 9 header. This kind of makes sense in order to have them all sorted correctly, but it would be better if we could somehow put all the titles that start with numbers under a single header, like #.

(...) it would be better if we could somehow put all the titles that start with numbers under a single header, like #.

Please don't. Don't use (fortiori hardcode) any character, which is valid to be used for sortkey.

If anything, then have the header using Mediawiki:category-section-numbers message.

Change 299108 had a related patch set uploaded (by Kaldari):
[WIP] Support for numeric collation

https://gerrit.wikimedia.org/r/299108

kaldari added subscribers: siebrand, Nikerabbit.EditedJul 15 2016, 4:20 AM

To solve the problem above (T8948#2465078), we'll need to modify IcuCollation::getFirstLetter() to look for "-u-kn" (or whatever the special suffix is) and in the cases where the first letter is a number, return '#' instead (or something like '0–9').

@Nikerabbit, @siebrand: Is '#' a good international symbol for 'all numbers' or will we need to localize this as suggested by Danny_B?

Per my understanding you are not actually storing the symbol in the collation field, so allowing localisation of the symbol should be trivial. That said '#' is called the number sign so I expect it to work well for most languages. But on the other hand I would not be surprised if languages such as Chinese for example would have a different way of indicating numbers.

I should have written that the message should contain ie. word "Numbers", not symbol. The deal is, that # is being used in sortkeys typically for the purpose of taking the particular page out of the default regular page name sorting (no matter the collation), as # is not valid character in page title.

One typical usecase I've seen (this is actually artificially created example based on what I remember to illustrate the principle) is putting the navbox templates aside:

Category:American rappers

#
Template:American rappers

2
2Pac

5
50 Cent

E
Eminem

N
Notorious B.I.G

(...)

So with the numerical sort it should be rather like:

Category:American rappers

#
Template:American rappers

Numbers
2Pac
50 Cent

E
Eminem

N
Notorious B.I.G

(...)

TheDJ added a subscriber: TheDJ.Jul 15 2016, 11:43 AM

Yes, # is already used as a special case by the community indeed.

https://en.wikipedia.org/wiki/Category:All_articles_needing_additional_references
and
https://en.wikipedia.org/wiki/List_of_7th_Heaven_characters

Shows one such usage. There are multiple conventions here actually, * is also often used for this purpose as is the space character.

I don't want to break your current ideas, but what about something like logarithmic scale?
1
1
2
9
10
10
12
21
100
202
303
909
1000
9982
A
Alabama
...

(...) it would be better if we could somehow put all the titles that start with numbers under a single header, like #.

Please don't. Don't use (fortiori hardcode) any character, which is valid to be used for sortkey.

If anything, then have the header using Mediawiki:category-section-numbers message.

I agree with Danny that it would be better to make it localisable. From what I recall, all Polish paper encyclopaedias I've seen use "0–9" for this kind of heading. An English dictionary I have uses "Numbers that are entries" for the heading.

@Nikerabbit, @siebrand: Is '#' a good international symbol for 'all numbers' or will we need to localize this as suggested by Danny_B?

Per my understanding you are not actually storing the symbol in the collation field, so allowing localisation of the symbol should be trivial. That said '#' is called the number sign so I expect it to work well for most languages. But on the other hand I would not be surprised if languages such as Chinese for example would have a different way of indicating numbers.

'#' is not widely used as "number sign" in Polish (the character itself is usually called "crossbars", "grid" or "fence": https://pl.wikipedia.org/wiki/Kratka_(symbol)). I think it would be understood by most Poles (especially if it would only have numbers underneath it, it would be pretty obvious), but it would be a mark of imperfectly localised software ;)

I should have written that the message should contain ie. word "Numbers", not symbol. The deal is, that # is being used in sortkeys typically for the purpose of taking the particular page out of the default regular page name sorting (no matter the collation), as # is not valid character in page title.

Yes, # is already used as a special case by the community indeed.

For the record, even if we used '#' as the heading (as first-letter), this would not cause major problems. The order is determined previously, so in the worst case, the entries with '#' as sortkey would appear first under the '#' heading, followed by the numeric entries. I agree this would be also imperfect.

To solve the problem above (T8948#2465078), we'll need to modify IcuCollation::getFirstLetter() to look for "-u-kn" (or whatever the special suffix is) and in the cases where the first letter is a number, return '#' instead (or something like '0–9').

Yeah. If we wanted a localised message, I think we should return some unique value here and handle the localisation in CategoryViewer. It currently does things like $wgContLang->convert( $this->collation->getFirstLetter( $sortkey ) ) and that might not behave correctly when user language is different from content language.

I don't want to break your current ideas, but what about something like logarithmic scale?

I'm not sure if it would work in general. It might be better than a single heading for many cases, but it would be very weird for e.g. lists of years.

@Nikerabbit, @siebrand: Is '#' a good international symbol for 'all numbers' or will we need to localize this as suggested by Danny_B?

'#' isn't ideal even in English; its use to designate "number" is an American English thing, and very uncommon in British English until quite recently (due to the gentle push of American English imperialism). Internationalising the displayed string would be better if possible.

I don't want to break your current ideas, but what about something like logarithmic scale?

I'm not sure if it would work in general. It might be better than a single heading for many cases, but it would be very weird for e.g. lists of years.

Yeah, you're right.

@Bawolff: Where'd you get the suffix "-u-kn"? Is that following some existing convention?

Its from the unicode collation standard.

I don't want to break your current ideas, but what about something like logarithmic scale?
1
1
2
9
10
10
12
21
100
202
303
909
1000
9982
A
Alabama
...

I have no idea if this is the approach we want to take, but From a technical prespective this would work fine. However if a title used digit separators (e.g. 100,000 with a comma in it) the sorting might not be right (im unsure)

Dvorapa added a comment.EditedJul 15 2016, 6:22 PM

I don't want to break your current ideas, but what about something like logarithmic scale?
1
1
2
9
10
10
12
21
100
202
303
909
1000
9982
A
Alabama
...

I have no idea if this is the approach we want to take, but From a technical prespective this would work fine. However if a title used digit separators (e.g. 100,000 with a comma in it) the sorting might not be right (im unsure)

Well, then users should remember to add sortkeys ([[Category:People with more than 1,000 USD|1000]]; [[Category:People with more than 1,000,000 USD|1000000]])

Well, then users should remember to add sortkeys ([[Category:People with more than 1,000 USD|1000]]; [[Category:People with more than 1,000,000 USD|1000000]])

I thought the whole point of this task was to eliminate the need for sortkeys :P Personally, I think just having a single header for all numbers is the simplest and most intuitive solution. We just have to figure out the right header.

Sortkeys are used for lots of things other than numbers.

I tend to agree about single header probably being best solution

Well, then users should remember to add sortkeys ([[Category:People with more than 1,000 USD|1000]]; [[Category:People with more than 1,000,000 USD|1000000]])

I thought the whole point of this task was to eliminate the need for sortkeys :P

The idea is to minimize the number of sortkeys needed, and to make them sane where they are necessary. Currently you'd have to write [[Category:People with more than 1,000 USD|0001000]] resp. [[Category:People with more than 1,000,000 USD|1000000]], unless I didn't count correctly and unless there aren't people with even more money. So even if you need a sortkey, being able to use plain numbers would be a big improvement.

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptJul 16 2016, 12:29 PM
kaldari added a comment.EditedJul 18 2016, 9:39 PM

BTW, I've confirmed that the NUMERIC_COLLATION flag affects Arabic numerals, Eastern Arabic numerals, and Bengali numerals (at the least). It doesn't affect Japanese numerals though. This is probably because Japanese numerals are often used as parts of words, whereas strictly numerical uses are typically written with Arabic numerals, e.g. 3人 (three people).

@Bawolff: I have an implementation written at https://gerrit.wikimedia.org/r/#/c/299108/, but it doesn't seem to work right. When a new article that starts with a number is added to a category, it just gets sorted to the end regardless of where it falls numerically. If I run updateCollation.php, though, it gets sorted correctly. For some reason...

Collation::singleton()->getSortKey( $title->getCategorySortkey( $prefix ) );

... gives a different result from LinksUpdate.php than it does from updateCollation.php for the same page.

Also, I've noticed that when I add a single article to a category and run updateCollation.php, multiple sortkeys get changed (which doesn't happen when LinksUpdate.php runs). Any idea what's going wrong here?

Personally, I think just having a single header for all numbers is the simplest and most intuitive solution. We just have to figure out the right header.

I found three books on my shelf with an index having both first-letter-headlines and entries starting with a number. All three books decided to just use a blank headline for those entries.

So using a message for the headline, that is empty by default, but can be configured by wiki sysops to use any desired heading the wiki wants to use doesn't seem unreasonable.

@Bawolff: I have an implementation written at https://gerrit.wikimedia.org/r/#/c/299108/, but it doesn't seem to work right. When a new article that starts with a number is added to a category, it just gets sorted to the end regardless of where it falls numerically. If I run updateCollation.php, though, it gets sorted correctly. For some reason...

Collation::singleton()->getSortKey( $title->getCategorySortkey( $prefix ) );

... gives a different result from LinksUpdate.php than it does from updateCollation.php for the same page.

Also, I've noticed that when I add a single article to a category and run updateCollation.php, multiple sortkeys get changed (which doesn't happen when LinksUpdate.php runs). Any idea what's going wrong here?

I see nothing in the patch that could cause this. Make sure that you're using the same version of PHP/HHVM for the website (Special:Version), command line (php --version), and I guess the job queue (not sure how to check that).

I found three books on my shelf with an index having both first-letter-headlines and entries starting with a number. All three books decided to just use a blank headline for those entries.

So using a message for the headline, that is empty by default, but can be configured by wiki sysops to use any desired heading the wiki wants to use doesn't seem unreasonable.

We allow users to set a blank headline by using a space for the DEFAULTSORT, though :/

@matmarex: Thanks for the suggestion! I switched over to Vagrant, which has the versions in sync and everything seems to work great. I'm going to remove the [WIP] tag from the patch.

kaldari set the point value for this task to 5.
kaldari claimed this task.
kaldari moved this task from Ready to Needs Review/Feedback on the Community-Tech-Sprint board.

I found three books on my shelf with an index having both first-letter-headlines and entries starting with a number. All three books decided to just use a blank headline for those entries.

So using a message for the headline, that is empty by default, but can be configured by wiki sysops to use any desired heading the wiki wants to use doesn't seem unreasonable.

We allow users to set a blank headline by using a space for the DEFAULTSORT, though :/

Yes, but I don't think that this really prevents using a blank heading for numbers as well. Look at https://de.wikipedia.org/wiki/Kategorie:Roman,_Epik and use your developer tools to replace the # heading by a blank one. I think this looks acceptable as default.

I found three books on my shelf with an index having both first-letter-headlines and entries starting with a number. All three books decided to just use a blank headline for those entries.

So using a message for the headline, that is empty by default, but can be configured by wiki sysops to use any desired heading the wiki wants to use doesn't seem unreasonable.

We allow users to set a blank headline by using a space for the DEFAULTSORT, though :/

Yes, but I don't think that this really prevents using a blank heading for numbers as well. Look at https://de.wikipedia.org/wiki/Kategorie:Roman,_Epik and use your developer tools to replace the # heading by a blank one. I think this looks acceptable as default.

Therefore the numberred ones will be merged with main articles, am I right? Many wikis have their main articles under blank heading and this change could completely break their rules. I'm for # and for the possibility to translate it. Also the logarithmic scale could do, but if we want only one heading, this could be the best choice.

kaldari added a comment.EditedJul 22 2016, 5:48 PM

Therefore the numberred ones will be merged with main articles, am I right? Many wikis have their main articles under blank heading and this change could completely break their rules. I'm for # and for the possibility to translate it. Also the logarithmic scale could do, but if we want only one heading, this could be the best choice.

The number articles would not be merged with the main articles. Articles that use punctuation for their sort keys (including spaces) will still appear before any articles that start with numbers. Even if you changed the header for number articles to be a space (since you'll be able to override the header locally), it would still appear as a separate list, and articles with other types of punctuation (like ?, !, @, #) would appear between them.

kaldari closed this task as "Resolved".Jul 27 2016, 10:52 AM

Patch is merged. Will test on test wiki this week.

Change 299108 merged by jenkins-bot:
Adding support for numeric collation when using UCA collations

https://gerrit.wikimedia.org/r/299108

Change 301380 had a related patch set uploaded (by Bartosz Dziewoński):
Implement NumericUppercaseCollation

https://gerrit.wikimedia.org/r/301380

NumericUppercaseCollation could be an alternative solution to UCA collations – it behaves like the current default 'uppercase' collation (so there is no correct sorting for letters other than A-Z and a-z), with a clever trick for numeric ordering.

I think the UCA numeric collation Kaldari implemented is a much better solution, but in cases where we can't use that one (e.g. because of no support for given language in ICU), 'uppercase' is satisfactory and the wiki really wants numeric ordering, this one could be useful.

Change 303001 had a related patch set uploaded (by Kaldari):
Adding release notes about the addition of numeric sorting support

https://gerrit.wikimedia.org/r/303001

Change 303001 merged by jenkins-bot:
Adding release notes about the addition of numeric sorting support

https://gerrit.wikimedia.org/r/303001

Change 301380 merged by jenkins-bot:
Implement NumericUppercaseCollation

https://gerrit.wikimedia.org/r/301380

Johan added a comment.Sep 22 2016, 8:20 PM

Wikimedia wikis that want this can now request it. To do so:

  1. Please start a community discussion – RfC, vote, or however your wiki normally decides these things – to make sure there’s support for it.
  2. Once you’re sure it has support, post on User:DannyH (WMF)’s talk page on Meta to with a link to the discussion where you took the decision.

(Translatable instructions.)

Why are we not using the regular process of creating Phabricator tickets like for all other site requests?

@Nikerabbit: Not everyone has Phabricator accounts (or knows how to use Phabricator). We'll still be filing Phabricator tickets for each site request though (on their behalf) so that there's a paper trail.

Why is natural number sorting not working for files on Commons?

TheDJ added a comment.Sun, Jan 29, 7:44 PM

@Dvorapa did Commons ever request Natural Number sorting ?

@TheDJ Maybe not, maybe should

It was proposed there twice, but both times Fae objected to the idea ("a solution looking for a problem") and everyone else said "meh". As soon as the Commons community asks for the feature, we'll be happy to turn it on there. (original discussion, second discussion).