Page MenuHomePhabricator

Sorting in categories on cswiki
Closed, ResolvedPublic

Description

On cswiki, articles in categories are sorted by uca-cs uca-cs-u-kn Unicode collation.
We want to get a list of the actual order of symbols/letters for our language.

Context: We have some categorization rules on our wiki, but they are broken because of the conversion from ASCII to Unicode in collations (I know it was done a while ago, but nobody's cared until now). We want to update/improve the rules, but we don't know the order.

(I asked the same question on Meta's Tech Forum before)

See also:

Event Timeline

Quiddity subscribed.

I've edited the description, hopefully making it a bit clearer, plus linking to older/related tasks.

We want to get a list of the actual order of symbols/letters for our language.

Unicode includes more than 128,000 characters so that would be a very long list. The Czech collation is basically the same as standard UCA collation, but with the following differences:
https://ssl.icu-project.org/trac/browser/icu/tags/latest/source/data/coll/cs.txt
The best way to see how common Czech characters are actually sorted would be to look at a large category on Czech Wikipedia or test lists of common characters at https://ssl.icu-project.org/icu-bin/collation.html with the collation set to "cs". I tested with the standard Czech Alphabet and got:

  • A
  • Á
  • B
  • C
  • Č
  • D
  • Ď
  • E
  • É
  • Ě
  • F
  • G
  • H
  • Ch
  • I
  • Í
  • J
  • K
  • L
  • M
  • N
  • Ň
  • O
  • Ó
  • P
  • Q
  • R
  • Ř
  • S
  • Š
  • T
  • Ť
  • U
  • Ú
  • Ů
  • V
  • W
  • X
  • Y
  • Ý
  • Z
  • Ž

Which is the same as the order given at https://en.wikipedia.org/wiki/Czech_orthography.

@kaldari Yes, it would be a long list, but we want to know characters before alphabet and after alphabet, therefore it would be really useful to have a full-length list.

Your alphabet is correct, but it is not matching the uca-cs collation, where some of characters (like Ú and Ů) are merged into ones without diacritics (U) as you can see in T135846 (the issue was rejected because Czech Language Institute recommends this merging)

@Dvorapa: OK, so you're actually interested in which letters are separate in the collation. I assumed from the title you were asking about sort order. The first letter tailorings for Czech are:

'cs' => [ "Č", "Ch", "Ř", "Š", "Ž" ],

This means that Czech uses the same alphabet as English, but also adds "Č", "Ch", "Ř", "Š", and "Ž" as separate letters. Are there any other specific ranges you are interested in? For example, Greek letters or Cyrillic letters? It isn't going to be practical to post the entire character set.

@Dvorapa: OK, so you're actually interested in which letters are separate in the collation. I assumed from the title you were asking about sort order. The first letter tailorings for Czech are:

'cs' => [ "Č", "Ch", "Ř", "Š", "Ž" ],

This means that Czech uses the same alphabet as English, but also adds "Č", "Ch", "Ř", "Š", and "Ž" as separate letters. Are there any other specific ranges you are interested in? For example, Greek letters or Cyrillic letters? It isn't going to be practical to post the entire character set.

No, we know our alphabet and its collation, we just want to know, which symbols are before and which are after alphabet. The entire character set in uca-cs order would be perfect.

For example: Until now we sort maintenance categories to the end of category by "-" key. But since move from ASCII to Unicode collation, the "-" is broken and sorts at the beginning.
For example: Until now we sort portals to the beginning of category by "π" key. But since move from ASCII to Unicode collation, the "π" is broken and sorts at the end.
And many more examples. We want to update these rules, but we need to know, what are the possibilities to discuss.

All basic punctuation is sorted before the Czech Alphabet:

  • <space>
  • _
  • -
  • ,
  • ;
  • !
  • ?
  • .
  • '
  • "
  • (
  • )
  • [
  • ]
  • {
  • }
  • @
  • *
  • /
  • &
  • #
  • %
  • `
  • ^
  • +
  • <
  • =
  • >
  • |
  • ~
  • $
  • 0
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • A
  • Á
  • B
  • C
  • Č

etc.

I can't list the entire Unicode character set here, but you can find out the sort order for any characters you want by pasting them into https://ssl.icu-project.org/icu-bin/collation.html and clicking the "sort" button.

kaldari claimed this task.