Page MenuHomePhabricator

Tamil sort order
Closed, ResolvedPublic

Description

Author: sundarbecse

Description:
Showing Tamil consonant sequences

Reference page: https://ta.wikipedia.org/s/4om
If you see in the above page, 'ஜ' follows 'ச'. Characters like 'ஸ', 'ஷ', 'ஜ', 'ஹ' etc. are called grantha characters which are not part of the basic alphabets of Tamil. See https://en.wikipedia.org/wiki/Tamil_script#Basic_consonants They are added towards the end (i.e. after 'ன') by convention. The first column in the attached image shows the correct sequence. (Image source: Naga. Ilangovan)


Version: unspecified
Severity: enhancement

Attached:

alphabetsequence.jpg (494×328 px, 29 KB)

To fix this, we'll need to change the collation to uca-ta collation, and run the updateCollation.php script for all the Tamil wikis.

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 4:00 AM
bzimport set Reference to bz73453.
bzimport added a subscriber: Unknown Object (MLST).

I think we need to add Collation support for Tamil (not sure if we need upstream libicu stuff), and look at getting the category collation updated on tawiki

ICU appears to support Tamil (http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt), so we only need to add it to the list of supported collations and perhaps adjust first-letter generation. (And then confirm that it actually sorts the words correctly.)

sundarbecse wrote:

Thanks Same Reed and Bartosz Dziewoński. Yes, http://bugs.icu-project.org/trac/browser/icu/trunk/source/data/coll/ta.txt is correct for the consonant sequence. We just need to validate the overall sequence of vowels, consonants, compounds.

nelango5 wrote:

Following are the two other related issues that I would like to be added to this bug report.

  1. The sort position of letter ஃ should be after all the vowels. Currently, it is positioned like ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that, upon sorting all the Tamil letters, ஃ should appear after ஔ and before க்.
  1. The Consonant letters should appear on top of their compounding forms. If we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ). Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can be understood by sorting a few strings. Given the set of 4 strings as (கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி, கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம், கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ .

sundarbecse wrote:

(In reply to elan from comment #4)

Following are the two other related issues that I would like to be added to
this bug report.

  1. The sort position of letter ஃ should be after all the vowels. Currently,

it is positioned like ஃ, அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ. The right
order is அ, ஆ, இ, ஈ, உ, ஊ, எ, ஏ, ஐ, ஒ, ஓ, ஔ, ஃ. It should be noted that,
upon sorting all the Tamil letters, ஃ should appear after ஔ and before க்.

  1. The Consonant letters should appear on top of their compounding forms. If

we sort the letters (ம, ம், மா ), the right result is ( ம், ம, மா ).
Currently the order is (ம, மா, ம் ) which is wrong. The impact of this can
be understood by sorting a few strings. Given the set of 4 strings as
(கணமொழி, கணமூலி, கணம்புல், கணம்), current sort order results into (கணமூலி,
கணமொழி, கணம், கணம்புல்). This is wrong and the right order is (கணம்,
கணம்புல், கணமூலி, கணமொழி). (These 4 strings are proper Tamil words according
to Tamil lexicon @ http://dsalsrv02.uchicago.edu/dictionaries/tamil-lex/ .

Yes, the above sequence is the correct order.

Change 290529 had a related patch set uploaded (by Kaldari):
Set Tamil projects to use uca-ta collation

https://gerrit.wikimedia.org/r/290529

Change 290529 merged by jenkins-bot:
Set Tamil projects to use uca-ta collation

https://gerrit.wikimedia.org/r/290529

Mentioned in SAL [2016-05-24T23:20:04Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation (T75453) (duration: 02m 18s)

Um, folks, you broke it :( All category pages are unviewable now.

pasted_file (946×1 px, 159 KB)

The exception is "MediaWiki does not support ICU locale "ta"", which it indeed doesn't. The 'uca-ta' collation first has to be defined in /includes/collation/IcuCollation.php in MediaWiki core, by adding an entry for 'ta' to the $tailoringFirstLetters variable. I mentioned this here some time ago (T75453#775155).

The exception is "MediaWiki does not support ICU locale "ta"", which it indeed doesn't. The 'uca-ta' collation first has to be defined in /includes/collation/IcuCollation.php in MediaWiki core, by adding an entry for 'ta' to the $tailoringFirstLetters variable. I mentioned this here some time ago (T75453#775155).

Also as a reminder for next time, immediately after changing the collation settings, you have to run the updateCollation.php script, or things will be broken.

Change 290685 had a related patch set uploaded (by Dereckson):
Add support for icu-ta collation

https://gerrit.wikimedia.org/r/290685

Change 290686 had a related patch set uploaded (by Dereckson):
Revert "Revert "Set Tamil projects to use uca-ta collation""

https://gerrit.wikimedia.org/r/290686

kaldari claimed this task.

Oops, looks like this isn't actually resolved.

Change 291380 had a related patch set uploaded (by Kaldari):
Adding Tamil to IcuCollation::$tailoringFirstLetters

https://gerrit.wikimedia.org/r/291380

Change 291380 abandoned by Kaldari:
Adding Tamil to IcuCollation::$tailoringFirstLetters

Reason:
Redundant to https://gerrit.wikimedia.org/r/#/c/290685/

https://gerrit.wikimedia.org/r/291380

Change 290685 merged by jenkins-bot:
Add support for icu-ta collation

https://gerrit.wikimedia.org/r/290685

Change 290686 merged by jenkins-bot:
Revert "Revert "Set Tamil projects to use uca-ta collation""

https://gerrit.wikimedia.org/r/290686

Mentioned in SAL [2016-06-09T23:57:31Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation II (T75453) (duration: 00m 25s)

23:57 logmsgbot: dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set Tamil projects to use uca-ta collation II (T75453) (duration: 00m 25s)
00:07 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawiktionary --force
00:12 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawikisource --force
00:31 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawikiquote --force
00:39 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawikinews --force
00:40 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawikibooks --force
00:56 kaldari: ran mwscript maintenance/updateCollation.php --wiki=tawiki --force