Page MenuHomePhabricator

Malayalam Font not rendering correctly in Wikidata (due to several ways to enter a Chillu character)
Open, Needs TriagePublic

Description

Problem: Malayalam characters like ർ, ൽ, ൻ are displayed in two different formats when entered by 2 different users.

Example: https://www.wikidata.org/w/index.php?title=Q13564588&diff=1295032834&oldid=1290915945. Please inspect the code and check for the changes.

Expected outputഇന്ത്യൻ ചലചിത്ര അഭിനേതാവ്
Actual outputഇന്ത്യന്‍ ചലചിത്ര അഭിനേതാവ്(I request to inspect the code here too).

I have ran a bot on Schoolwiki (a project powered by Mediawiki) to purge the pages. An example edit is:
this

I also think that it has something to do with T33950: Update Malayalam fonts packages.

There are a minimum of 60,000 pages that contain this issue (https://quarry.wmflabs.org/query/47911).

Event Timeline

Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)
Reedy updated the task description. (Show Details)
Aklapper changed the task status from Open to Stalled.Nov 1 2020, 3:51 PM

@Adithyak1997: It's been an actual edit, so this is not about rendering only, I think, but about people actually entering different letters. What to fix here server-side?

Which input tools in which browser and browser version on which operating system have been used?

In Malayalam Unicode there is a character called Chillu. (ൺ,ൻ,ർ,ൽ,ൾ) are the chillu in Malayalam.

Now the problem,

according to Unicode encoding these chillu can be created by two ways. The Malayalam Unicode table (enwiki) says

  • ൺ -> 0d7a,
  • ൻ -> 0d7b,
  • ർ -> d7c,
  • ൽ ->0d7d,
  • ൾ ->0d7e,
  • ൿ -> 0d7f

are chillu. These are called atomic chillu.

But there is another way to create a chillu like

ന+്+zwj -> 0d28+0d4d+200D. That is [na ന]+[virāma ്]+[ZWJ].

The last thing is called Zero Width Joiner aka ZWJ (Unicode website).

image.png (156×178 px, 4 KB)

So we can represent a single character in two different ways, aka two different letter sequence in Malayalam. And we are trying to avoid that. Because this cause a big problem with search and hyperlinks.

To resolve this problem there was a fix introduced in Mediawiki called $wgFixMalayalamUnicode. If this set as true then all the chillu created by ZWJ will be replaced by the single character chillu (atomic chillu) on save of a page in mediawiki.

This is happening in Malayalam wikipedia and not happening in Wikidata. So there is a good mix of ZWJ chillu and Atomic chillu in Wikidata.

What may be the solution

Enable the $wgFixMalayalamUnicode in Mediawiki used in Wikidata. (But $wgFixMalayalamUnicode is depricated. The release note of 1.35 says it is true by default.)
Do a simple edit on all pages which uses ZWJ Chillu in Wikidata.

Now Wikidata is not converting ZWJ chillu to Atomic chillu. This may be happening because the default langauge of Mediawiki using in Wikidata is not Malayalam. But we need to set $wgFixMalayalamUnicode to true. So How to resolve this problem?

Also there is a thing called $wgAllUnicodeFixes . But this thing will do the fixes for all languages. If this set to true in Mediawiki used by Wikidata then the problem may be solved. @santhosh can you look into this problem?

Aklapper renamed this task from Malayalam Font not rendering correctly in Wikidata to Malayalam Font not rendering correctly in Wikidata (due to several ways to enter a Chillu character).Nov 1 2020, 5:39 PM
Aklapper changed the task status from Stalled to Open.

The rendering issue can be solved by just using a good quality font and latest version. There are many.

The dual encoding issue is quite complicated and has been debated a lot. Practically, the editors should use an input method that is bug free and following atomic chillu. The already done edits can be fixed using a bot. Any search/sort/processing on Malayalam should be aware of this complex encoding model(not limited to Chillu) and process the content accordingly rather than interpreting as raw byte sequences. I do not think that a configuration or extra processing wthin a single software like MediaWiki can solve the complications.

I do not think that a configuration or extra processing wthin a single software like MediaWiki can solve the complications.

Actually, this problem is not occurring in Malayalam Wikipedia, where, in my info, the variable/parameter mentioned by Ranjithsiji is set to be true. So, in whichever chillu the user is entering, it will actually be transformed to the one supported by Mediawiki, automatically, once it is enabled here I think.