Page MenuHomePhabricator

Changing the alphabetical sorting (collation) @ ba.wikipedia.org
Closed, ResolvedPublic

Description

Hello. Incorrect alphabetical sorting is used in the Bashkir Wikipedia. The Project Community supported the changing at this page: https://ba.wikipedia.org/wiki/Википедия:Тауыш_биреүҙәр/Башҡорт_алфавиты_буйынса_категориялаштырыу
The correct sorting of the Bashkir (Cyrillic) alphabet is as follows: А Б В Г Ғ Д Ҙ Е Ё Ж З И Й К Ҡ Л М Н Ң О Ө П Р С Ҫ Т У Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы Ь Э Ә Ю Я
This is also applies to other projects in the Bashkir language. Thanks.


See also: Upstream ticket http://unicode.org/cldr/trac/ticket/10195

Event Timeline

I agree this must be done, and should also be submitted upstream to CLDR.

The suggested order is probably OK, but I cross-checked it with the Academic Dictionary of the Bashkir Language (Ufa 2011):

20170413_155551.jpg (3×4 px, 3 MB)

... and I have a few questions:

  • Should Й be between И and К?
  • It looks like Ң and Ҫ can never be in the beginning of the word. It's probably not an issue, and they should be sorted as the initial task description suggests, but I'm checking just in case :)
  • Ъ, Ы, and Ь are not mentioned at all. Judging by the image in the dictionary and by Wikipedia articles such as Ыҡ, Ы can even appear in the beginning of the word. I guess they should be between Щ and Э. Again, just checking.

Amire80, thanks for your comment and your attention.
Answers from the Bashkir native (User:З. ӘЙЛЕ):
“1. Й - Yes, it should, because some names beginning from the letter Й.

  1. Yes, you're right.
  2. Yes, there are few words beginning from Ы. And even a pair of surnames. If we need to set the sorting strictly according to the Bashkir alphabet, then let's do it. Surely, some letters will be “empty”, but I think that's okay.”

OK, this clarifies everything I need to know, and I support doing this of course, but I wonder how exactly. May @Nemo_bis, @matmarex, or @Nikerabbit will know the precise process:

  • Eventually this should be in ICU. But how does it get there? If I submit this as a ticket to CLDR, will it eventually get also to ICU?
  • Until it gets into ICU, can we change it locally in MediaWiki?

I couldn't find the answers in mediawiki.org. It should be documented in https://www.mediawiki.org/wiki/Localisation .

Amire80 renamed this task from Changing the alphabetical sorting @ ba.wikipedia.org to Changing the alphabetical sorting (collation) @ ba.wikipedia.org.Apr 14 2017, 9:57 AM

(Added the word "collation" to the task description so it would be easier to find.)

I'm not really that familiar with how the process works. I only worked on implementing MediaWiki support for languages that were already supported by ICU.

Looking at commenters on a similar task T48235: Kurdish Wikipedia: Alphabetical order in the categories (collation), @Bawolff or @kaldari might be able to help.

We can't easily add collations unsupported by ICU in MediaWiki. The best thing we could do is to use a collation for a similar language instead, if the alphabet is close enough.

Thanks @matmarex. I see that the Kurdish issue, which you cited, involves a CLDR ticket as I thought, so I submitted one for Bashkir: http://unicode.org/cldr/trac/ticket/10195

Suggestions about anything else that can be done to make this faster or better are welcome :)

As far as I remember, there was a problem with sorting of the letter Ё in Russian Wikipedia, but it was a long time ago and I don't know how this problem was fixed.

Yes, getting it into cldr is the first step. Be prepared for this being a super-long time before the change gets to wikimedia.

For the cldr ticket, it may help to clarify if there are any secondary differences in the sorting i.e. things that only come into play in the event of a tie - e.g. diacritics. It would also be good to specify which things in the ordering are only tertiary differences (e.g Case differences). Specifying the proposed sort order in the ldml tailoring notation would probably be ideal (http://www.unicode.org/reports/tr35/tr35-collation.html#Rules ) [im guessing here. Im not associated with cldr so i really have no idea what they actually want. This is just a guess]-

We can try to hack something together in the meantime for mediawiki (unfortunately php bindings to libicu dont support tailoring, so the most easiest way ofdoing that is impossible, but we can still replace letters and things).

I couldn't find the answers in mediawiki.org. It should be documented in https://www.mediawiki.org/wiki/Localisation .

Tips to get a locale added to CLDR are available at https://translatewiki.net/wiki/CLDR .

Change 350792 had a related patch set uploaded (by Brian Wolff; owner: Brian Wolff):
[mediawiki/core@master] Add collation for Bashkir (ba)

https://gerrit.wikimedia.org/r/350792

Change 350792 merged by jenkins-bot:
[mediawiki/core@master] Add collation for Bashkir (ba)

https://gerrit.wikimedia.org/r/350792

Now that we've written the workaround, we need a configuration change to set $wgCategoryCollation = 'uppercase-ba' for the Bashkir Wikipedia, and confirmation from the community that it behaves like they expect.

Change 353099 had a related patch set uploaded (by Amire80; owner: Amire80):
[operations/mediawiki-config@master] Set collation for Bashkir wikis to uppercase-ba

https://gerrit.wikimedia.org/r/353099

Change 353099 merged by jenkins-bot:
[operations/mediawiki-config@master] Set collation for Bashkir wikis to uppercase-ba

https://gerrit.wikimedia.org/r/353099

Mentioned in SAL (#wikimedia-operations) [2017-06-12T13:13:50Z] <hashar@tin> Synchronized wmf-config/InitialiseSettings.php: Set collation for Bashkir wikis to uppercase-ba - T162823 (duration: 00m 41s)

Amire80 claimed this task.
Amire80 removed a project: Patch-For-Review.

Done, deployed, and tested!

Thanks to @Bawolff for making this. It appears to work well, and it might become the pioneering solution that may be applied to more languages.

hashar subscribed.

So fawikibooks is all fine:

[bawikibooks]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uppercase-ba |        4 |
+--------------+----------+

But fawiki has 9 left over links:

[bawiki]> SELECT cl_collation,count(*) FROM categorylinks GROUP BY cl_collation;
+--------------+----------+
| cl_collation | count(*) |
+--------------+----------+
| uppercase    |        9 |
| uppercase-ba |   366532 |
+--------------+----------+
$ mwscript updateCollation.php --wiki=bawiki
Fixing collation for 9 rows.          <------ There are still nine there
Selecting next 100 rows... processing...0 done.
0 rows processed
[bawiki]> SELECT * FROM categorylinks WHERE cl_collation='uppercase' \G
*************************** 1. row ***************************
          cl_from: 122083
            cl_to: Бейеклеге_буйынса_тау_түбәләре
       cl_sortkey: 0
БЕЙЕКЛЕГЕ 1000 МЕТРҒА САҠЛЫ ТҮБӘЛӘР
     cl_timestamp: 2017-06-06 12:32:27
cl_sortkey_prefix: 0
     cl_collation: uppercase
          cl_type: subcat
*************************** 2. row ***************************
          cl_from: 122082
            cl_to: Венгрия_географияһы
       cl_sortkey: ВЕНГРИЯНЫҢ ТАУҘАРЫ
     cl_timestamp: 2017-06-06 12:29:50
cl_sortkey_prefix: 
     cl_collation: uppercase
          cl_type: subcat
*************************** 3. row ***************************
          cl_from: 122082
            cl_to: Википедия:Ссылка_на_категорию_Викисклада_отсутствует_в_Викиданных
       cl_sortkey: ВЕНГРИЯНЫҢ ТАУҘАРЫ
     cl_timestamp: 2017-06-06 12:29:50
cl_sortkey_prefix: 
     cl_collation: uppercase
          cl_type: subcat
*************************** 4. row ***************************
          cl_from: 122082
            cl_to: Википедия:Статьи_с_переопределением_значения_из_Викиданных
       cl_sortkey: ВЕНГРИЯНЫҢ ТАУҘАРЫ
     cl_timestamp: 2017-06-06 12:29:50
cl_sortkey_prefix: 
     cl_collation: uppercase
          cl_type: subcat
*************************** 5. row ***************************
          cl_from: 122082
            cl_to: Европа_тауҙары
       cl_sortkey: ВЕНГРИЯ
ВЕНГРИЯНЫҢ ТАУҘАРЫ
     cl_timestamp: 2017-06-06 12:29:50
cl_sortkey_prefix: Венгрия
     cl_collation: uppercase
          cl_type: subcat
*************************** 6. row ***************************
          cl_from: 122082
            cl_to: Илдәр_буйынса_тауҙар
       cl_sortkey: ВЕНГРИЯ
ВЕНГРИЯНЫҢ ТАУҘАРЫ
     cl_timestamp: 2017-06-06 12:29:50
cl_sortkey_prefix: Венгрия
     cl_collation: uppercase
          cl_type: subcat
*************************** 7. row ***************************
          cl_from: 122102
            cl_to: Йылдар_буйынса_ҡалыптар
       cl_sortkey: ГРЕЦИЯ
ДЕСЯТИЛЕТИЯ В ГРЕЦИИ
     cl_timestamp: 2017-06-07 05:25:39
cl_sortkey_prefix: Греция
     cl_collation: uppercase
          cl_type: page
*************************** 8. row ***************************
          cl_from: 122102
            cl_to: Навигация_ҡалыптары:Греция
       cl_sortkey: ДЕСЯТИЛЕТИЯ В ГРЕЦИИ
     cl_timestamp: 2017-06-07 05:25:39
cl_sortkey_prefix: 
     cl_collation: uppercase
          cl_type: page
*************************** 9. row ***************************
          cl_from: 122102
            cl_to: Навигация_ҡалыптары:Категориялар_өсөн
       cl_sortkey: ДЕСЯТИЛЕТИЯ В ГРЕЦИИ
     cl_timestamp: 2017-06-07 05:25:39
cl_sortkey_prefix: 
     cl_collation: uppercase
          cl_type: page
9 rows in set (0.00 sec)

[bawiki]>

But fawiki has 9 left over links:

I guess you meant bawiki :)

9 rows in set (0.00 sec)

[bawiki]>

Mmmm... is there anything left to do? Does it have to remain open?

Why wasn't it completely done the first time?

I would like one to figure out why there are 9 entries in categorylinks still having cl_collation='uppercase' (instead of uppercase-ba). The break down being:

[bawiki]> SELECT cl_type,cl_collation,count(*) FROM categorylinks GROUP BY cl_type,cl_collation;
+---------+--------------+----------+
| cl_type | cl_collation | count(*) |
+---------+--------------+----------+
| page    | uppercase    |        3 |
| page    | uppercase-ba |   304451 |
| subcat  | uppercase    |        6 |
| subcat  | uppercase-ba |    59154 |
| file    | uppercase-ba |     2944 |
+---------+--------------+----------+
5 rows in set (0.24 sec)
[bawiki]>

Is this query from labs replica or actual db? In the past labs replicas have had replication issues related to DELETEs on categorylinks table that cause it to retain old rows that arent really there

The cl_from entries in those pages are missing (i.e. no page table entry) e.g. https://ba.wikipedia.org/w/index.php?curid=122102&uselang=en gives a badtitle error.

Category pages inner join on the page table, so these entries wont be shown on categories

That is from the production database. Since the pages no more exist (thank you for the check), I guess they are just artifact entry and we don(t have any system to garbage collect them.