Page MenuHomePhabricator

Bugs with numerical sorting on Bengali
Closed, ResolvedPublic0.5 Estimated Story Points

Description

On Bengali WP, Category:Years still has "0" and "1":

https://bn.wikipedia.org/wiki/%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80:%E0%A6%AC%E0%A6%9B%E0%A6%B0

There's some other odd things about that category that may be happening because they're using leading 0s in sortkeys: 0922 etc. Hopefully it'll look less weird when that fix goes out. Still, 0 and 1 aren't right.

On bn.wikisource, there's 0-9 as a heading, rather than the Bengali numbers ০-৯ :

https://bn.wikisource.org/wiki/%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80:%E0%A6%AC%E0%A6%9B%E0%A6%B0_%E0%A6%85%E0%A6%A8%E0%A7%81%E0%A6%AF%E0%A6%BE%E0%A6%AF%E0%A6%BC%E0%A7%80_%E0%A6%9C%E0%A6%A8%E0%A7%8D%E0%A6%AE

Reported by Bodhisattwa:
https://meta.wikimedia.org/wiki/User_talk:DannyH_(WMF)#Enable_numerical_sorting_on_bn.40wikipedia_and_bn.40wikisource

Event Timeline

@Bodhisattwa, @DannyH: Most of these issues are due to the fact that Bengali is not supported by MediaWiki's language-specific collations, thus we had to switch Bengali Wikipedia and Wikisource to the gereric numeric sorting, which isn't aware of Bengali numerals. To fix these problems, we will need to add support for Bengali to the IcuCollation class, switch the Bengali wikis from numeric to uca-bn-u-kn collation, and regenerate their sort keys.

Change 318666 had a related patch set uploaded (by Brian Wolff):
Make NumericUppercaseCollation use localized digit transforms

https://gerrit.wikimedia.org/r/318666

Change 318666 merged by jenkins-bot:
Make NumericUppercaseCollation use localized digit transforms

https://gerrit.wikimedia.org/r/318666

@Bodhisattwa: We now have two solutions to offer you:

  1. Localized digits are now properly supported by the "numeric" collation, which the Bengali wikis are already using. Once the new code is deployed next week, we can simply regenerate the sort keys and the Bengali digits should now sort correctly.
  2. We have added support for Bengali to MediaWiki's UCA collation class. Thus we could also switch the Bengali wikis to use the "uca-bn-u-kn" collation, which is the Bengali version of the Unicode Collation Algorithm with numeric sorting.

Which of these would you prefer? Personally, I would recommend the UCA collation as it should keep you from having to use custom DEFAULTSORT keys in more cases, but if you're happy with the existing sorting, you're welcome to keep it.

Which of these would you prefer? Personally, I would recommend the UCA collation as it should keep you from having to use custom DEFAULTSORT keys in more cases, but if you're happy with the existing sorting, you're welcome to keep it.

@kaldari, I hope, the issue with ড, ঢ, য, ড়, ঢ়, য় as discussed in T148885 will not be a problem with the second option.

@Bodhisattwa: Looks like those may still be a problem. See T148885#2798077. I guess we'll just stick with numeric collation for now.

After the new code is deployed to the Bengali wikis next week, we'll rerun the updateCollation script which should fix the problem.

kaldari set the point value for this task to 0.5.Nov 17 2016, 6:13 PM
kaldari moved this task from Needs Discussion to Up Next (May 20-June 3) on the Community-Tech board.

Scripts run. Should be good now.