Page MenuHomePhabricator

Ask Wiktionaries if they want natural number sorting in categories
Closed, ResolvedPublic

Description

Do Wiktionaries want natural number sorting in categories? The response from Wikipedians so far has been very positive, but given that Wiktionaries are dictionaries, they might have a different opinion. We should ask them and see, and make sure the answers don't differ depending on if they're using a non-Latin script or not.

English Wiktionary has already been asked.

Event Timeline

Johan created this task.Mar 1 2016, 7:06 PM
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptMar 1 2016, 7:06 PM
Johan triaged this task as Normal priority.Mar 1 2016, 8:10 PM
Johan moved this task from Untriaged to CL, QA, Data analysis backlog on the Community-Tech board.
Johan added a comment.Mar 2 2016, 2:28 PM

I've also asked German and Swedish Wiktionary. Looking at how a few Wiktionaries with non-Latin scripts handle things now.

Johan updated the task description. (Show Details)Mar 2 2016, 11:58 PM
Johan added a comment.Mar 3 2016, 10:18 PM

So far, we've got one reply on German Wikipedia and one on Swedish, but at least they're both positive to numerical sorting.

On English Wiktionary, 2 editors have endorsed natural number sorting and none have endorsed lexigraphic number sorting.

Johan added a comment.EditedMar 3 2016, 11:41 PM

So at least Germanic languages written with the Latin alphabet seem to prefer numerical sorting.

Numerical collation can be tested at https://ssl.icu-project.org/icu-bin/collation.html. Just be sure to turn "numeric" on in the settings.

Johan added a comment.Mar 8 2016, 1:13 PM

Do we have any plans to change how numerical sorting is handled for other systems than Arabic numerals?

@Johan: I was hoping that we could switch all the collations to numerical sorting to make it simple. I have no idea, however, how or if that actually affects non-Arabic numerals.

I just tested and confirmed that the numerical sorting option in ICU has no effect on sorting Japanese numerals (which are the only non-Arabic numerals I'm personally familiar with). I'm guessing that means that numerical sorting only affects Arabic numerals.

I just tested and confirmed that the numerical sorting option in ICU has no effect on sorting Japanese numerals (which are the only non-Arabic numerals I'm personally familiar with). I'm guessing that means that numerical sorting only affects Arabic numerals.

It seems to affect Eastern Arabic numerals in my testing. Try this:

٠
١
١١
۹
۹۹
١٠٠

0
1
11
9
99
100

I know nothing of Japanese numerals, but from clicking the links to Wiktionary at https://en.wikipedia.org/wiki/Japanese_numerals#Basic_numbering_in_Japanese, it seems the characters have additional meanings, so I guess ICU can't guess that they're used as merely numbers? Or maybe because they're derived from Chinese characters and Chinese collations are a major mess? No idea.

@matmarex: You're totally right. Thanks for checking that! Next question: which wikis actually use Eastern Arabic numerals (if any)? I poked around on Arabic Wikipedia, but almost everything I found was in "regular" Arabic numerals. For example, https://ar.wikipedia.org/wiki/9_%D9%85%D8%A7%D8%B1%D8%B3 (which has a redirect from https://ar.wikipedia.org/w/index.php?title=%DB%B9_%D9%85%D8%A7%D8%B1%D8%B3&redirect=no). Maybe @Moushira would be a good person to ask :)

Try https://ckb.wikipedia.org/, it's my go-to example ;)

(Side note: MediaWiki supports using different numerals in the interface, but Arabic Wikipedia actually has this feature disabled: https://github.com/wikimedia/operations-mediawiki-config/blob/a711cd58/wmf-config/InitialiseSettings.php#L4375, and it seems they consistently use Western Arabic numerals (the "regular" 0-9) everywhere; I don't know enough about Arabic to tell why :). There are a couple more languages which have different digit transforms defined: https://github.com/wikimedia/mediawiki/search?q=digitTransformTable, and presumably most Wikipedias in these languages use the same digits in articles.)

Johan added a comment.Mar 11 2016, 2:44 AM

A few more replies on German and Swedish Wikipedia now, all in favour of numerical sorting. I've tried to come up with examples of where it would be problematic, but the one potential I've come up with is Chinese/Japanese numerals (because of how characters are sorted), and if they're not included, then that should be less problematic. I've asked them on their "here we reply to questions in English" pages – both of which seem completely deserted – but no replies so far.

Johan added a comment.Mar 11 2016, 3:22 PM

Questions for non-Latin communities: Arabic, Japanese, Mandarin Chinese. Not sure any of them will reply, they don't have a lot of activity in those places.

Moushira edited subscribers, added: Meno25, OsamaK; removed: StudiesWorld.Mar 11 2016, 4:11 PM
Moushira added a subscriber: Zack.
Johan moved this task from Backlog to Do now on the User-Johan board.Mar 22 2016, 6:11 PM
Johan closed this task as Resolved.Apr 2 2016, 8:29 AM
Johan moved this task from Do now to Archive on the User-Johan board.

@Johan, @kaldari, sorry to jump into a task that's marked resolved, but what's the current thinking on Eastern Arabic numerals? From playing around with the collation demo site, it looks like the standard numerical algorithm works correctly, with @matmarex's example producing the following output:

٠
0
١
1
۹
9
١١
11
۹۹
99
١٠٠
100

I'm not a native Arabic speaker, but I'm fairly proficient, and I'd expect that most Arabic speakers are familiar with both systems (with the standard Arabic numerals being somewhat more common in modern writing—that's why the Arabic Wikipedia uses them). So while they'd be surprised to see one publication using both systems, they'd see 100 and ١٠٠ as the same number and I expect they wouldn't be surprised to see them sorted together in pure numerical order.

If you got confirmation that it's wrong from a native speaker, ignore me, but I wouldn't assume that it's wrong just because sorts both systems together.