Page MenuHomePhabricator

Investigation: Numerical sorting in categories
Closed, ResolvedPublic3 Estimated Story Points

Description

Investigate what would be involved in Numerical sorting in categories, which is likely to be in the Wishlist Survey top 10.

The main ticket: T8948: Natural number sorting in category listings

It's been hanging around since 2006, and there might be a good reason for that.

If this is adding an extra column to an important table with millions of rows, it might not be feasible to fix.

Also see T32673: Implement central locale-specific, or tailored, sorting framework (tracking)

Event Timeline

DannyH raised the priority of this task from to Needs Triage.
DannyH updated the task description. (Show Details)
DannyH added a project: Community-Tech.
DannyH moved this task to Needs Discussion on the Community-Tech board.
DannyH subscribed.
Niharika set Security to None.
DannyH triaged this task as Medium priority.Dec 10 2015, 6:25 PM

A few notes:

Wikis must use the icu collation library in order to have numeric collation. Some are already using it. English didn't want to, need to find out why.

For wikis that are using the icu library, you just have to change the configuration for it (in Collation.php). See T8948#1919137.

The existing config variable for turning on icu collation is $wgCategoryCollation.

For non-icu wikis, it converts the page title to uppercase and then stores as binary UTF-8. The sorting is based on that.

Here's a short preliminary investigation:

Most of what we need here is covered by Bartosz's summary here: https://meta.wikimedia.org/wiki/Community_Tech/Numerical_sorting_in_categories/Notes#Message_from_Bartosz.2C_Dec_22

ICU library does support numerical sorting as can be tested at https://ssl.icu-project.org/icu-bin/collation.html (Turn numeric sorting on, give different inputs and click Sort)

Bawolff's comment here: T8948#1919137 describes how this can possibly be implemented.

We'd probably need to tweak the implementation to work with different languages which might be a bigger problem. Also keep RTL languages as a factor in mind.

A note from @matmarex on the Meta talk page, about supporting the umlauts:

"This should be entirely possible today (without any further development work), by setting $wgCategoryCollation to uca-de. The only problem could be the updateCollation script taking too long, but dewiki categorylinks table has only ~10M rows too."

Here are the next steps:

  • Improve performance of updateCollation script (T58041) - This is still moving along. There is a patch (https://gerrit.wikimedia.org/r/#/c/272416/), but it requires further testing and investigation, which @jcrespo is working on. There's not a lot we can do on this particular part to help other than shepherding the patch.
  • Figure out if it’s necessary to overload $wgCategoryCollation with an option for numerical sorting. It's possible that all WMF wikis, even Wiktionary, will want to use natural sorting, in which case, we may be able to do something relatively simple, rather than creating a whole new set of collation options. See T8948#2076236.
  • Fix the numerical header bug (T128483).
  • Test the ICU numerical sorting feature on test wiki.
  • Test the ICU numerical sorting feature on a real wiki.
  • Change the default setting for $wgCategoryCollation to use numerical sorting.
kaldari moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.

Wikimedia wikis that want this can now request it. To do so:

  1. Please start a community discussion – RfC, vote, or however your wiki normally decides these things – to make sure there’s support for it.
  2. Once you’re sure it has support, post on User:DannyH (WMF)’s talk page on Meta to with a link to the discussion where you took the decision.

(Translatable instructions.)