Page MenuHomePhabricator

Add support for Bengali to IcuCollation class
Closed, ResolvedPublic

Description

Support for Bengali is available in the ICU libraries in use on Wikimedia servers. However, we don't yet support Bengali in MediaWiki's IcuCollation class.

The collation data for Bengali can be found at https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll/bn.txt.

Event Timeline

The collation data for Bengali appears to be pretty complicated. This may require some help from @Bawolff.

kaldari removed the point value for this task.Oct 25 2016, 9:07 PM

Looking at this some more, we think that this should be done by a volunteer who knows Bengali. It's too easy to make mistakes if you don't actually know the language and the writing system.

@Aklapper Do we have a tag for tickets that could be picked up by volunteers?

Niharika subscribed.

@Aklapper Do we have a tag for tickets that could be picked up by volunteers?

Yep. :)

@DannyH , I can help with the writing system.

@Bodhisattwa: Basically, we need to figure out how to translate the data at https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll/bn.txt into data for IcuCollation.

One complication is that Bengali apparently has two different collation systems: "standard" and "traditional". Do you know what the difference is between these and which one would be more appropriate for Bengali Wikipedia and Wikisource?

@kaldari, both in Bengali WP and WS, the category contents are sorted alphabetically which is absolutely ok. So, the traditional system looks more appropriate than the standard system as it is based on alphabetical sorting.

@Bodhisattwa: Basically, we need to figure out how to translate the data at https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll/bn.txt into data for IcuCollation.

One complication is that Bengali apparently has two different collation systems: "standard" and "traditional". Do you know what the difference is between these and which one would be more appropriate for Bengali Wikipedia and Wikisource?

The Wikipedia article states:

It seems likely that standardization of the alphabet will be greatly influenced by the need to typeset it on computers. The large alphabet can be represented, with a great deal of ingenuity, within the ASCII character set, omitting certain irregular conjuncts. Work has been underway since around 2001 to develop Unicode fonts, and it seems likely that it will split into two variants, traditional and modern

So presumably that means that there are two ways of encoding bengali in unicode, a modern "way" with less symbols, and a traditional way that preserves the large amount of traditional symbols, but is annoying to type out.

Edit: I don't have any evidence this is really true, so it quite likely isn't

Looking at the collation files, the difference between "traditional" (প্রথাগত সজ্জাক্রম) and "standard" (আদর্শ বাছাই বিন্যাস) seem to be mostly about how the character ্ (U+09CD BENGALI SIGN VIRAMA) is sorted when it is combined with various other characters.

There are also some differences at the tertiary level in regards to the sorting of ় (U+09BC BENGALI SIGN NUKTA)

The "Traditional" collation is apparently taken from "Biswas: Samsad Bengali-English Dictionary ISBN:8186806865"

Anyways, for the purpose of first letters array, I think we need the following to be treated as first letters:

  • For the standard collation:
    • ং (U+982 BENGALI SIGN ANUSVARA)
    • ঃ (U+983 BENGALI SIGN VISARGA)
    • ঁ (U+981 BENGALI SIGN CANDRABINDU)
  • For the traditional collation (I'm not as sure about this, but I'm pretty sure)
    • 37 characters (first three are same as the standard collation): ং ঃ ঁ ক্ খ্ গ্ ঘ্ ঙ্ চ্ ছ্ জ্ ঝ্ ঞ্ ট্ ঠ্ ড্ ঢ্ ণ্ ৎ থ্ দ্ ধ্ ন্ প্ ফ্ ব্ ভ্ ম্ য্ র্ ৰ্ ল্ ৱ্ শ্ ষ্ স্ হ্
    • It would appear that the pre-existing headers for ক, খ, etc get removed because MW thinks they are a "duplicate prefix". e.g. ক is an expansion of ক্ so it gets removed in favour of the base character ক. Why/how this is an expansion, I have no idea.

@kaldari, both in Bengali WP and WS, the category contents are sorted alphabetically which is absolutely ok. So, the traditional system looks more appropriate than the standard system as it is based on alphabetical sorting.

@Bodhisattwa So keep in mind that the bn.txt file linked above doesn't include all the sorting rules, only the one's that are different from CLDR rules for all languages. So both are actually alphabetical, just a large number of rules are omitted (for both). The main difference between traditional and standard is whether ঙ্ (for example) is considered a separate letter from ঙ that gets its own header in category pages, and are sorted as if they are two totally separate letters or in standard where ঙ and ্ are considered separate letters, and ঙ্ is sorted as if its the letter ঙ followed by the letter ্ (So ঙ্ would go under the header for ঙ in categories).

The symbol ় is also treated slightly differently in some combinations in the traditional mode collation (However in both collations, the ় symbol is only used as a tie-breaker if two words would otherwise sort the same, its just used as a tie breaker in a slightly different way in the traditional collation)

You can test the difference between the two collations at https://ssl.icu-project.org/icu-bin/collation.html (In the drop down in the upper-left corner select bn for the standard collation and bn-u-co-trad for "traditional" order.

Change 318260 had a related patch set uploaded (by Brian Wolff):
Add first letter data for bn collation (Standard and Traditional)

https://gerrit.wikimedia.org/r/318260

or if ঙ and ্ are considered separate letters, and ঙ্ is sorted as if its the letter ঙ followed by the letter ্ (So ঙ্ would go under the header for ঙ in categories and not have its own header).

Yes, this is the option we need. (Traditional)

You can test the difference between the two collations at https://ssl.icu-project.org/icu-bin/collation.html (In the drop down in the upper-left corner select bn for the standard collation and bn-u-co-trad for "traditional" order.

Thanks for this link. I have tested both the system. Traditional one is the collation system we need.

I think I got confused and misunderstood how this worked. Some of what I described about the "traditional" order method might not be right.

For example, traditional sorts the following list like so:

ক্

  • ক্

ঘ্

  • ঘ্
  • ঘঙ
  • ঘচ
  • ঘ্ঙ
  • ঘ্চ

Where ঘ্ comes before ঘ when its by itself, but not when there are other letters in the word. In essence, ঘ্ is the base letter, and ঘ is sorted as if it was ঘ্ followed by a letter that sorts very early in the list. This was not what I expected just reading the bn.txt file. Also I was wrong about having two sections for ঘ্ and ঘ. In traditional both are grouped under ঘ্ (In standard both are grouped under ঘ)

@Bawolff: Thanks for diving into this! I have to admit it's a bit over my head.

Add 'ড়্‌', 'ঢ়্‌', 'য়্‌' after 'হ্‌' here
i.e. the sequence will be 'শ্', 'ষ্', 'স্', 'হ্', 'ড়্‌', 'ঢ়্‌', 'য়্‌'

Add 'ড়্‌', 'ঢ়্‌', 'য়্‌' after 'হ্‌' here
i.e. the sequence will be 'শ্', 'ষ্', 'স্', 'হ্', 'ড়্‌', 'ঢ়্‌', 'য়্‌'

Those three letters appear to be tertiary (seconday in standard collation) different from the existing letters in the list ড্ ঢ্ and য্. That means they are treated the same for sorting, unless there is a tie in which case the difference is used as a tie breaker.

This would cause things to be messed up if those letters were headers, since stuff in sorted order would interleve between the two.

For example, consider the page names: ড়্‌A ড়্‌M ড়্‌Z ড্B ড্Q ড্A. (I'm using latin letters for the second letter just because I know the alphabetical order of the latin alphabet. Same principle would apply to pages with entirely Bengali letters) The algorithm (Both in standard and traditional mode) sorts this as: ড্A ড়্‌A ড্B ড়্‌M ড্Q ড়্‌Z. If we had both ড্ and ড়্‌ as headers, then this would make a category like:

ড্

  • ড্A

ড়্‌

  • ড়্‌A

ড্

  • ড্B

ড়্‌

  • ড়্‌M

ড্

  • ড্Q

ড়্‌

  • ড়্‌Z

Which is presumably not what is wanted.

Well, 'ড়্‌', 'ঢ়্‌', 'য়্‌' - these three letters are totally different from 'ড্', 'ঢ্‌', 'য্‌' and they have their own place in the Bengali alphabet table. For the above example, the category, ideally should look like this

  • ড্A
  • ড্B
  • ড্Q

  • ড়্‌A
  • ড়্‌M
  • ড়্‌Z

Huh.

Ok. I just tested this with actual mediawiki, and what you want to happen actually does happen (with the traditional collation). I do not know why the output of https://ssl.icu-project.org/icu-bin/collation.html is different from actual mediawiki. Maybe future versions of libicu will change their behaviour or something.

/me is rather confused.

Digging into this more:

bn.txt line 570 says:

&ড্<<<ড়্

Which I believe means that ড়্ should come after ড্ when comparing at a tertiary level (ie As a tie breaker in case things are tied at the primary and secondary levels). This seems to have been present in bn.txt for the last 7 years.

This is consistent with how https://ssl.icu-project.org/icu-bin/collation.html behaves, and not what @Bodhisattwa wants

However on my local test wiki (libicu 52.1), ড়্ seems to get expanded at the primary level as if its ড্ followed by something with a sortkey of \x26\x24 (ie. ড্ has a sort key \x26\x25\x13 and ড়্ has a sort key \x26\x25\x13\x24). MediaWiki would detect this as a prefixed sortkey, so would not let you have both of those as category headers. I have no idea why its being expanded like that. I wonder if this difference in behaviour is related to the farsi weird sorting issue bug (T139110). Maybe there's just something messed up with the particular compile of libicu we all happen to be using or something like that.

My previous example which worked locally appears to have worked locally mostly by random chance. If you take the pages ড্অ, ড়্অ, ড়্হ, and ড্হ - the sort order ends up being ড্অ, ড়্অ, ড্হ, and ড়্হ, which again alternates the first letter between ড্ and ড়্. Thus it does not provide a solution to this issue.

So anyways, as far as actually fixing this goes, there's two options:

  1. Decide that the way UCA sorts ড়্ as not really being a letter independent from ড্ is acceptable and thus switch bnwiki to the bn-u-kn@collation=traditional collation
  2. Modify the existing "generic" uppercase-based numeric sort algorithm to better support localized numbers (Probably something we should do anyway). This would keep letter sorting the same as the old version (where it works on unicode code points. No fancy things like making ত্ ৎ sort as the same letter.

@Bodhisattwa: Which solution would you prefer?

  1. Modify the existing "generic" uppercase-based numeric sort algorithm to better support localized numbers (Probably something we should do anyway). This would keep letter sorting the same as the old version (where it works on unicode code points. No fancy things like making ত্ ৎ sort as the same letter.

Frankly speaking, the existing alphabetical sorting system, in Bn WP/WS categories, is absolutely fine and we are not having problem with that. We just have the problem with the numerical sorting system. So, if you can just modify the existing system and make it work for the numbers, then, it would be just fine.

  1. Modify the existing "generic" uppercase-based numeric sort algorithm to better support localized numbers (Probably something we should do anyway). This would keep letter sorting the same as the old version (where it works on unicode code points. No fancy things like making ত্ ৎ sort as the same letter.

Frankly speaking, the existing alphabetical sorting system, in Bn WP/WS categories, is absolutely fine and we are not having problem with that. We just have the problem with the numerical sorting system. So, if you can just modify the existing system and make it work for the numbers, then, it would be just fine.

Ok, work towards doing that is at https://gerrit.wikimedia.org/r/#/c/318666/ (Technically that's more T148873 than this bug).

@Bawolff, @kaldari @DannyH, any update regarding the numerical sorting issue?

Change 318260 merged by jenkins-bot:
Add first letter data for bn collation (Standard and Traditional)

https://gerrit.wikimedia.org/r/318260

kaldari claimed this task.
kaldari moved this task from Product backlog to Archive on the Community-Tech board.

@Bodhisattwa: If I use uca-bn collation I get:

  • ড্A
  • ড়্‌M
  • ডZ

If I use uca-bn@collation=traditional I get:

ড্

  • ডZ
  • ড্A
  • ড়্‌M

Reading through your earlier comments, it sounds like neither of those is correct. Is that accurate?

Reading through your earlier comments, it sounds like neither of those is correct. Is that accurate?

@kaldari, both are incorrect. It should be like,

  • ডZ
  • ড্A

  • ড়্‌M

The issue with ড, ঢ, ব, য and ড়, ঢ়, র, য় has been discussed in T7948.