Unicode normalization "sorts" Hebrew/Arabic/Myanmar vowels wrongly
OpenPublic

bzimport set Reference to bz2399.
Hippietrail created this task.Via LegacyJun 13 2005, 1:08 PM
Nahum added a comment.Via ConduitJul 28 2005, 10:53 AM

The bug as I noticed it, is caused by the special characters used for vowels,
dagesh, right & left shin etc. not being sorted properly by wiki / probably not
being recognized as RTL.

Lots of free texts in Hebrew are quite ancient and depend on Niqqud to be read
properly, so fixing this bug should take a high priority IMHO.

brion added a comment.Via ConduitJul 28 2005, 4:30 PM

Input text is checked for valid UTF-8 and normalized to Unicode Normalization Form C (canonical composed form).

Someone needs to provide:

  • Short and exact before and after examples
  • If possible, a comparison against other Unicode normalization implementations to show if we're performing

normalization incorrectly

If there is an error in my normalization implementation, and it can be narrowed down, I'd be happy to fix it.
If this is the result of the correct normalization algorithm, I'm not really sure what to do.

bzimport added a comment.Via ConduitJul 28 2005, 6:28 PM

dovijacobs wrote:

For a typical before and after example, see the following comaprison of versions:

http://he.wikisource.org/w/index.php?title=%D7%90%D7%92%D7%A8%D7%AA_%D7%94%D7%A8%D7%9E%D7%91%22%D7%9F&diff=2794&oldid=1503

In that example, the only change actually made by the user was adding a category
at the end, but when the text was saved, the order of vowels was altered in most
of the words in the text.

If what Brion means is an example of a single word or something like that, it
will be hard to provide examples because only texts contributed until December
show "before" examples.

However, maybe this will help: When vowelized texts from word processors like
Word and Open Office are pasted into Wiki edit boxes, the vowels are
automatically changed to the wrong positions in the wiki coding.

bzimport added a comment.Via ConduitJul 28 2005, 6:36 PM

jeluf wrote:

Dovi, what browser are you using, and which version of it? Which operating system?

Looking at the diff that you provided, checking the first few lines, those look
OK to me.
All the letters are identical on the right and on the left.

bzimport added a comment.Via ConduitJul 28 2005, 6:41 PM

jeluf wrote:

Comparing with Brion's laptop (he uses MacOS 10.4, I use 10.3.9) the letters
differ between mine and his. There are dots in some of Brion's letters where I
don't see any.

brion added a comment.Via ConduitJul 28 2005, 6:44 PM

(I was testing in Safari and JeLuF in Firefox. They may render differently, or have been using different fonts...)

Yes, I would very much like to get individual words. You can copy them out of the Wikipedia pages if you like.

Very very helpful for each of these would be:

  • The 'before' formatting, saved in a UTF-8 text file (notepad on Windows XP is ok for this)
  • The 'after' formatting, saved in a UTF-8 text file
  • A detailed, close-up rendering of what it's supposed to look like (screen shot of 'before' correctly rendered, using a large enough font size I can tell the difference)
  • A detailed, close-up rendering of what it ends up looking like

If possible, a description of which bits have moved or changed and how this affects the reading of the text.

bzimport added a comment.Via ConduitJul 28 2005, 7:06 PM

eran_roz wrote:

a txt file in utf-8

Attached: bug2399

bzimport added a comment.Via ConduitJul 30 2005, 9:45 AM

sgb wrote:

I’m using IE 6 in Win2K Professional, and
I’ve been seeing this problem as well. Text
that I created a year or so ago in Arabic
are fine, but if I now open and re-save them
(using all of the same software as before),
Arabic vowel pairs become reversed. I can
provide you here with some examples, one
with the vowels together, and another
separating the vowels with a tashdid
(baseline) ... then you can remove the
tashdid and bring the vowels together to see
what happens. (Tahoma would be a good font
to see this.)

  1. This pair is supposed to look like a

little superscript w with an '''over'''line:
سسّـَس سسَّس (if you get an '''under'''lined w,
it’s reversed).

  1. This pair is supposed to look like a

little superscript w with
an '''under'''line: سسّـِس سسِّس (if the
underline is below the entire '''word'''
rather than below the little '''w''', it’s
reversed).

  1. This pair is supposed to look like a

little superscript w with a '''double
over'''line: سسّـًا سسًّا (if you get a w with
double '''under'''line, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a '''double
under'''line: سسّـٍا سسٍّا (if the double
underline is below the entire word rather
than below the little w, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a comma above it:
سسّـُس سسُّس (if the comma is '''in''' the w
rather than above it, it’s reversed).

  1. This pair is supposed to look like a

little superscript w with a '''fancy'''
comma above it: سسّـٌا سسٌّا (if the fancy comma
is '''in''' the w rather than above it, it’s
reversed).

As I am looking at this note '''before''' I
save it, everything on my screen appears
correct. After I save it, all six examples
will be reversed. You can insert spaces in
the examples to separate the vowels, and you
should find that they have become the
reverse order from the control examples with
tashdids (baselines) in them.

bzimport added a comment.Via ConduitJul 30 2005, 9:54 AM

sgb wrote:

I just now sent the above message (# 8)
concerning Arabic vowel pairs, and I see
that all of the vowel pairs are correct.
Clearly, the "bugzilla" software is
different from the "en.wiktionary.org"
software.

If you will copy my examples from the above
message into a Wiktionary page, you will see
how they become reversed.

brion added a comment.Via ConduitJul 30 2005, 8:02 PM

Here's the given string broken into groups of base and combining characters:

d7 91 U+05D1 HEBREW LETTER BET
d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped

d7 99 U+05D9 HEBREW LETTER YOD
d6 b0 U+05B0 HEBREW POINT SHEVA

d7 91 U+05D1 HEBREW LETTER BET
d6 bc U+05BC HEBREW POINT DAGESH OR MAPIQ < in normalized string, this
d6 b7 U+05B7 HEBREW POINT PATAH < sequence is swapped

d7 a8 U+05E8 HEBREW LETTER RESH
d6 b0 U+05B0 HEBREW POINT SHEVA

d7 a1 U+05E1 HEBREW LETTER SAMEKH

The only change here in the normalized string is that the dagesh+patah
combining sequence is re-ordered into patah+dagesh.

I've tried displaying the before and after texts in Internet Explorer 6.0
(Windows XP), in Firefox Deer Park Alpha 2 (Mac OS X 10.4.2), and Safari 2.0
(Mac OS X 10.4.2). The two strings appear the same, even zoomed in, on IE/Win
and Firefox/Mac. In Safari the dots are slightly differently positioned.
I do not know if this slight different is relevant or 'real'.

Python program to confirm that another implementation gives the same results:

from unicodedata import normalize
before = u"\u05d1\u05bc\u05b7\u05d9\u05b0\u05d1\u05bc\u05b7\u05e8\u05b0\u05e1"
after = u"\u05d1\u05b7\u05bc\u05d9\u05b0\u05d1\u05b7\u05bc\u05e8\u05b0\u05e1"
coded = normalize("NFC",before)
if (coded == before) or (coded != after):

print "something is broken"

else:

print "as expected"
brion added a comment.Via ConduitJul 30 2005, 8:04 PM

Created attachment 754
Strings from attachment 1 displaying identically in IE 6.0 on Windows XP Professional SP2

Attached:

brion added a comment.Via ConduitJul 30 2005, 8:10 PM

Created attachment 755
Hightlighted display difference in Safari on Mac OS X 10.4.2

The dots show slightly displaced in Safari 2.0 on Mac OS X 10.4.2 in the
normalized text.
Is that movement (from the black dot location to the red dot location)
significant?

They *do not* display differently in Firefox DeerPark alpha 2 on the same
machine.
Both string forms display identically on that browser and OS.

They *do not* display differently in Internet Explorer 6.0 on Windows XP
Professional SP2.
Both string forms display identically on that browser and OS.

Attached:

Nahum added a comment.Via ConduitAug 3 2005, 10:30 AM

The problem is only (I think) on win 98 and XP prior to SP2.

bzimport added a comment.Via ConduitOct 7 2005, 10:06 AM

sgb wrote:

I’ve been requesting a fix for the incorrect
Arabic normalization (compound vowels) for
months, but Arabic still cannot be entered
and saved properly in en.wiktionary
articles, and I have never received a reply
to my requests. I don’t know if I haven’t
made myself clear, if no one has had the
time, or if no one thinks I know what I’m
talking about.

I use Firefox 1.0.7 and also IE 6 in Win2K
Pro. It makes no difference which browser I
use, I cannot save Arabic files correctly in
en.wiktionary...nor can anyone else,
apparently, because whenever somebody opens
an old Arabic article to make some small
change, the vowels become incorrectly
reversed upon saving.

I’ve been typesetting Arabic professionally
since the 1970’s and I know how it’s
supposed to be written. If you need
examples, either here or on en.wiktionary, I
can easily provide them.

In short, the current normalization produces
the wrong results with all compound vowels:
shadda+fatha, shadda+kasra, shadda+damma,
and shadda+fathatan, shadda+kasratan,
shadda+dammatan. In the following examples,
(A) = correct, and (X) = wrong:
(A) عصَّا ; (X) عصَّا
(A) عصِّا ; (X) عصِّا
(A) عصُّا ; (X) عصُّا
(A) عصًّا ; (X) عصًّا
(A) عصٍّا ; (X) عصٍّا
(A) عصٌّا ; (X) عصٌّا

Under the current normalization, if anyone
opens a page containing (A), it will become
(X) when he saves it (even if he makes no
changes). One example is
http://en.wiktionary.org/wiki/حسن , which
was written with all the correct vowels
prior to implementation of normalization
(and which appeared correctly), but has
since had to have some of its vowels removed
because of this serious problem.

I will be happy to explain further if anyone
needs clarification.

brion added a comment.Via ConduitOct 7 2005, 9:19 PM

What I need is a demonstration of incorrect normalization. This
is a Unicode standard and, as far as I have been able to test,
everything is running according to the standard.

Pretty much every current XML-based recommendation, file format
standard and protocol these days is recommending use of Unicode
normalization form C, which is what we're using. If this breaks
Arabic and Hebrew, then a lot of things are going to break it
the same way.

If there's a difference in rendering, is it:

  • A bug in the renderer?
  • Is this an operating system bug? (old versions of Windows)
  • Is this an application bug? (browser etc)
  • A bug in the normalization implementation?
  • A bug in the normalization rules that Unicode defines?
  • A bug in the Unicode data files?
  • A corrupt copy of the Unicode data files?

The impression I've been given is that it's a bug in old versions
of windows and that things render correctly on Windows XP. Can
you confirm or contravene this?

Can you make a clear, supportable claim that a particular normalized
character sequence is incorrectly formed? If so, how should it be
formed? Is the correct formation normalized or not? If not why not?
If so why isn't it what we get from normalizing the input?

Is there an automatic transformation we can do on output? If so what?
If there is, should we do so? What are the complications that can arise?

Or perhaps the error is in the arrangement of the original input?
Where does the input come from and what arranges it? Is it arranged
correctly? If not how should it be arranged? How can it be arranged?

Is there an automatic transformation we can do on input? If so what?
If there is, should we do so? What are the complications that can arise?

On these questions I've gotten a lot of nothing. The closest has been
an example of a string in 'before' and 'after' state, which appears to
render identically in Windows... so what's the problem?

Nahum added a comment.Via ConduitOct 8 2005, 3:42 PM

I can confirm that the bug has been fixed in Hebrew in the Service Pack 2 of Win
XP but not in earlier versions. If this is the case in Arabic as well, which our
Arabic-reading members can check, then we probably should add in the main he.wiki
pages and the equivalent Arabic ones an explanation of the problem with a
recommendation to upgrade to said OS & Service Pack.

bzimport added a comment.Via ConduitOct 11 2005, 3:19 PM

iwasinnam wrote:

Correct rendering of the string "Bibi" with fixed-width font

Attached: Bibi_-_correct_rendering.bmp

bzimport added a comment.Via ConduitOct 11 2005, 3:21 PM

iwasinnam wrote:

Inorrect rendering of the string "Bibi" with fixed-width font

screenshot taken in wiki editor box after pressing 'Show Preview'

Attached: Bibi_-_incorrect_rendering.bmp

bzimport added a comment.Via ConduitOct 11 2005, 3:32 PM

iwasinnam wrote:

If indeed the Unicode normalization rules imply the switching of the DAGESH and
the PATAH (as demonstrated in comment #10), then I suppose it's a bug in the
renderer.
As for the way things _should_ be, it is completely insignificant for a user which
way the symbols are stored. In Hebrew (manual) writing it is completely
insignificant if the DAGESH is written down before PATAH or vice-versa. When
typing text on a computer (at least in Windows), the text is displayed and stored
correctly only if the DAGESH is entered first. I haven't here the tools to examine
the way it is stored internally, but it's nevertheless renderend correctly any
time. This is not the case in Wiki. Once the procedure switches the two symbols,
the DAGESH is displayed _outside_ of the BET. An obvious misrendering (see
attachments id=978, id=979).
I have experienced this bug in Widnows 2000 as well as Windows XP with IE 6.0.x.
I believe this should be considered a significant bug as these are highly popular
environments. Moreover, Hebrew (and Arabic) are used mostly in scriptures, poetry
& transliteration of foreign words and names. Many Wiki pages (especially in
Wikitext) contain such texts. The bug renders such text as hard to read and is
_very_ appearent to any user that tries to read these texts (and very annoying for
myself as I am currently writing about China and constantly need to transliterate
Chinese names).

Nahum added a comment.Via ConduitOct 11 2005, 3:47 PM

(In reply to comment #19, by Ariel Steiner)
Ariel, did you experience this bug in Win XP with Service Pack 2? I use that and
see Hebrew with nikkud on wiki perfectly. Others have reported this bug to exist
in Win XP with SP1 but without SP2, so I assume it has been fixed in the latter
service pack.

bzimport added a comment.Via ConduitOct 15 2005, 7:21 AM

iwasinnam wrote:

I experienced the bug on both WinXP (no SP2) & Win2K, both with IE6 and Firefox
1.0.7. I don't see why a user should upgrade from Win2K (or Me) to WinXP SP2
just because of a nikkud problem

bzimport added a comment.Via ConduitOct 16 2005, 8:44 AM

dovijacobs wrote:

I'd like to add to Ariel's comments that nikkud works perfectly fine in various
fonts and on all platforms for word processors: Word for Windows and Open Office.
Why should Mediawiki be any different? Don't the word processors also use Unicode?

Dovi

brion added a comment.Via ConduitOct 16 2005, 10:23 AM

Dovi, typical word processors probably aren't applying canonical normalization to
text.

Ok, spent some time googling around trying to find more background on this. Basically
there seem to be two distinct issues:

  1. The normalization rules order some nikkud combinations differently from what the

font render in old versions of Windows expects. This is a bug in either Windows or
the font. From all indications that have been given to me, this is fixed in the
current version of Windows (XP Service Pack 2).

  1. In some rarer cases appearing in at least Biblical Hebrew, actual semantic

information may be lost by application of normalization. This is a bug in the Unicode
standard, but it's already established. Some day they may figure out a proper
workaround.

As for 1), my inclination is to recommend that you upgrade if it's bothering you.
Turning off normalization in general would open us up to various weird data
corruption, confusing hard-to-reach duplicate pages, easier malicious name spoofing,
etc. If Microsoft has already fixed the bug in their product, great. Use the fixed
version or try a competing OS.

It might be possible to add a postprocessing step to re-order output to what old
buggy versions of Windows expect, but this sounds error-prone.

As for 2), it's not clear to me whether this is just a phantom problem that _might_
break something or if it's actually breaking text. (Most stuff is probably affected
by problem 1.) There's not much we can do about this if it happens other than turning
off normalization (and all that entails).

Background links:
http://www.unicode.org/faq/normalization.html#8
http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html
http://lists.ibiblio.org/pipermail/biblical-languages/2003-July/000763.html

Hippietrail added a comment.Via ConduitOct 16 2005, 3:10 PM

Does anybody know if the Windows bugs were in the fonts, in
Uniscribe, or in both? Can the new Uniscribe handle the old
fonts for instance?

If all or part of the problem was with the fonts, then what
about 3rd party fonts not under Microsoft's control?

Also, has Microsoft issued any kind of fix for OSes other than
XP?

Has anybody tested this on any Unix or Linux platforms? How does
Pango handle this?

Without knowing the answers to all these questions, I would lean
to a user option to perform a post-normalization compatibility
re-ordering.

bzimport added a comment.Via ConduitOct 27 2005, 10:43 PM

gangleri wrote:

Hallo!

[[en:Wikipedia_talk:Niqqud#Precombined_characters_-_NON-precombined_characters]]
relates about some notes received from
http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm : Recommendations
for Displaying Yiddish text on Web Pages.

Depending on platforms, browsers, characters (and fonts?) one may experineced
some of the mentioned problems.

http://mysite.verizon.net/jialpert/YidText/YiddishOnWeb.htm suggests as "output"
preference to use "precombined characters" and to "postpone" "NON-precombined
characters" for later days.

Consequences: Wikimedia projects should provide at least some notes about the
problem (affected platforms / browsers / what to do / how to configure / upgrade
to ...)

Regards Reinhardt [[user:gangleri]]

bzimport added a comment.Via ConduitNov 5 2005, 1:03 AM

gangleri wrote:

Please see also
bug 3885: title normalisation

bzimport added a comment.Via ConduitApr 7 2006, 6:21 PM

rotemliss wrote:

I've tried to check what caused the problem, and I've detected the problem.

The problem is in UtfNormal::fastCombiningSort, in the file
phase3/includes/normal/UtfNormal.php. It combines the Nikud in the order of its
numbers in $utfCombiningClass (defined in
phase3/includes/normal/UtfNormalData.inc). This array, unserialized, is shown in
[[he:Project:ניקוד#איפיון הבאג]], in the <pre>. You can see Dagesh is 21, and
Patah is 18, so they are re-ordered: instead of Dagesh+Patah, we get
Patah+Dagesh. But they SHOULD be first Dagesh then Patah, because that's their
order - so it's a bug in MediaWiki that we re-order it. In WinXPSP2, they are
shown correctly because of a *workaround* (it's not a bugfix there - only a
workaround for mistakes), but their order is however wrong. Maybe in Vista it
they won't use this workaround.

The question is, what does this function (UtfNormal::fastCombiningSort) do?
What's it purpose? Why should it sort the Nikud, or anything else? It's already
sorted well. How is it related to the normalization? There is any documentation
about it?

You can just delete the Nikud from the array $utfCombiningClass, if you want to
operate the function.

Changing the summary, for that's exactly the bug. Also changing the OS and
Hardware, because the bug is not only there - the final view problem is there,
but the problem exists everywhere.

Thank you very much, and please answer my questions in the third paragraph, so
we will be able to fix that bug.

bzimport added a comment.Via ConduitApr 7 2006, 6:25 PM

rotemliss wrote:

(In reply to comment #27)

This array, unserialized, is shown in [[he:Project:ניקוד#איפיון הבאג]],
in the <pre>.

Now it's shown in [[User:Rotemliss/Nikud]].

brion added a comment.Via ConduitApr 7 2006, 7:13 PM

Rotem, this function implements a Unicode standard. The bug is in the standard.
Until some future version of Unicode "fixes" this, I'm just going to mark this
bug as LATER.

bzimport added a comment.Via ConduitMay 22 2006, 7:36 PM

iwasinnam wrote:

I for one totally support the suggested solution, namely "Remove the
normalization check" etc.
That would be ideal for the Hebrew Wikipedia since its guidelines strictly
forbid the use of nikkud (vowel markers) in its titles, i.e., there are no
composed letters in document titles. Seperating the title and display title
would also be very convenient because it will allow easy searching on one hand
and the use of nikkud in the display title where appropriate.

bzimport added a comment.Via ConduitJan 23 2007, 10:33 PM

kenw wrote:

Incidentally, this is not a "bug" in the Unicode Standard, and won't be fixed
later in that standard. The entire issue of canonical ordering of "fixed
position" class combining marks for Hebrew has been debated extensively on the
Unicode forums, but the outcome isn't about to change, because of requirements
for stability of normalization.

The problem is in people's interpretation of the *intent* of canonical
ordering in the Unicode Standard. (See The Unicode Standard, 5.0. p.
115.) "The canonical order of character sequences does *not* imply any kind of
linguistic correctness or linguistic preference for ordering of combining
marks in sequences." In effect, the Unicode Standard is agnostic about the
input order or linguistically preferred order of dagesh+patah (or
patah+dagesh). What normalization (and canonical ordering) *do* imply,
however, is that the two sequences are to be interpreted as equivalent.

It sounds to me like Mediawiki is implementing Unicode normalization correctly.

The bug, if anything, is in the *rendering* of the sequences, as implied by
some of the earlier comments on this. dagesh+patah or patah+dagesh should
render identically -- there is no intent that they stack in some different way
dependent on their ordering when rendered. The original intent of the fixed
position combining classes in the standard was that they applied to combining
marks whose *positions were fixed* -- in other words, the dagesh goes where
the dagesh is supposed to go, and the patah goes where the patah is supposed
to go, regardless of which order they were entered or stored.

Also, it should be noted that the Unicode Standard does not impose any
requirement that Unicode text be stored in normalized form. Wikimedia is free
to normalize or not, depending on its needs and contexts. Normalization to NFC
in most contexts is probably a good idea, however, as it simplifies
comparisons, sorts, and searches. But as in this particular case for Hebrew,
you can run into issues in the display of normalized text, if your rendering
system and/or fonts are not quite up to snuff regarding the placement of
sequences of marks for pointed Hebrew text.

--Ken Whistler, Unicode 5.0 editor

bzimport added a comment.Via ConduitJun 18 2008, 5:51 AM

dovijacobs wrote:

Hebrew vowelization seems much improved in Firefox 3. It would be nice to know exactly what changed and how, and have these things documented in case there are future problems.

Firefox 3 seems to correctly represent the vowel order for webpages in general and Wikimedia pages in particular.

The only anomaly I nevertheless found is that pasting vowelized text into the edit page only shows partial vowelization. On the "saved" wiki page it appears correctly.

bzimport added a comment.Via ConduitJun 18 2008, 12:20 PM

rotemliss wrote:

(In reply to comment #33)

Hebrew vowelization seems much improved in Firefox 3. It would be nice to know
exactly what changed and how, and have these things documented in case there
are future problems.

Firefox 3 seems to correctly represent the vowel order for webpages in general
and Wikimedia pages in particular.

The only anomaly I nevertheless found is that pasting vowelized text into the
edit page only shows partial vowelization. On the "saved" wiki page it appears
correctly.

The bug of showing the Dagesh and other vowels in the wrong order usually depends on operating system. For example, Windows XP (possibly only with Service Pack 2) displays it well, while older Windows systems don't.

However, Firefox 3.0 did fix some Hebrew vowels bugs, like the problem with Nikud with justified text (see https://bugzilla.mozilla.org/show_bug.cgi?id=60546 ).

Nikerabbit added a comment.Via ConduitJul 16 2008, 1:02 PM
  • Bug 14834 has been marked as a duplicate of this bug. ***
bzimport added a comment.Via ConduitJul 16 2008, 2:23 PM

ravi.chhabra wrote:

Since this bug also effects Myanmar exactly in the same way, could the title be appended with Myanmar as well? Normalization is not taking place the way it should. Here is the sort sequence it should be as specified in Unicode Technical Note #11.

Name Specification
Consonant [U+1000 .. U+102A, U+103F, U+104E]
Asat3 U+103A
Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y U+103B
Medial R U+103C
Medial W U+103D
Medial H U+103E
E vowel U+1031
Upper Vowel [U+102D, U+102E, U+1032]
Lower Vowel [U+102F, U+1030]
A Vowel [U+102B, U+102C]
Anusvara U+1036
Visible virama U+103A
Lower Dot U+1037
Visarga U+1038

I can provide more technical detail if needed. Hence U+1037 should always come after U+103A (even though U+103A is 'higher'). And U+1032 should come _before_ U+102F, U+1030, U+102B, U+102C and so on. I noticed that this but is related more to Unicode Normalization than it is to MediaWiki itself. But an important question I have is *can* Unicode Normalization Check be disabled for Myanmar Wikipedia while we try to resolve it? Thanks, because that would be very helpful?

bzimport added a comment.Via ConduitJul 16 2008, 2:33 PM

ayg wrote:

(In reply to comment #36)

Since this bug also effects Myanmar exactly in the same way, could the title be
appended with Myanmar as well?

You can do things like that yourself here.

But an important
question I have is *can* Unicode Normalization Check be disabled for Myanmar
Wikipedia while we try to resolve it? Thanks, because that would be very
helpful?

See [[mw:Unicode normalization concerns]]. This is feasible. We could turn off normalization for article text and leave it for titles, which would allow DISPLAYTITLE to be used to work around ugly display in titles. However, it would require some work.

bzimport added a comment.Via ConduitJul 16 2008, 3:00 PM

ravi.chhabra wrote:

I would prefer normalization as there are benefits from it, since it enforces a particular sequence. My question now is what kind of data should I provide to Brion Vibber so that he can implement the normalization for Myanmar? Our case is quite different from Hebrew and is more straight forward. I believe UTN#11 V2 would be sufficient? It was updated recently for Unicode 5.1

I would like to wait a while before actually thinking of disabling for article text and using work around for titles. If it can be implemented we won't need to off normalization, and would benefit from it. Thanks.

bzimport added a comment.Via ConduitJul 16 2008, 8:31 PM

ayg wrote:

It would almost certainly be a bad idea to use different normalization for a single wiki. This would create complications when trying to, for instance, import pages. If this is genuinely an issue for Myanmar, we should fix it in the core software for all MediaWiki wikis that contain any Myanmar text. Same for Hebrew and Arabic.

What exactly is the issue here? Some user agents render theoretically equivalent sequences of code points differently, so normalization changes display? Which user agents are these?

bzimport added a comment.Via ConduitJul 16 2008, 9:03 PM

ravi.chhabra wrote:

Relative Order (Normalization?) for Unicode 5.1 Myanmar

Attached:

bzimport added a comment.Via ConduitJul 16 2008, 9:04 PM

ravi.chhabra wrote:

Relative Order (Normalization?) for pre-Unicode 5.1/Myanmar

Attached:

bzimport added a comment.Via ConduitJul 16 2008, 9:44 PM

ravi.chhabra wrote:

I have attached two images. The first one shows normalization sequence for 5.1, and the 2nd one shows normalization sequence for pre Unicode 5.1. It is drastically different. The copy of those two can be found here.
http://unicode.org/notes/tn11/myanmar_uni-v2.pdf
Page 4 for latest, and page 9 for deprecated.

The normalization done at MediaWiki seems to be for pre 5.1. I am added pre 5.1 table here.
Name Specification
kinzi U+1004 U+1039
Consonant [U+1000 .. U+102A]
Stacked U+1039 [U+1000 .. U+1019, U+101C, U+101E, U+1020, U+1021]
Medial Y U+1039 U+101A
Medial R U+1039 U+101B
Medial W U+1039 U+101D
Medial H U+1039 U+101F
E vowel U+1031
Lower Vowel [U+102F, U+1030]
Upper Vowel [U+102D, U+102E, U+1032]
A Vowel U+102C
Anusvara U+1036
Visible virama U+1039 U+200C
Lower Dot U+1037
Visarga U+1038

Yes, normalization changes display. I have attached a jpeg file showing the error caused here https://bugzilla.wikimedia.org/show_bug.cgi?id=14834

bzimport added a comment.Via ConduitJul 16 2008, 10:26 PM

ayg wrote:

Contents of includes/normal/UtfNormalData.inc

As far as I can tell, MediaWiki is indeed using the 5.1 tables. I've attached the data used for normalization, which is generated by a script that downloads the appropriate files from http://www.unicode.org/Public/5.1.0/ucd/. If you can spot an error, please say what it is.

You might want to talk to Tim Starling, since as far as I can tell he's the one who wrote this.

Attached: utf8

bzimport added a comment.Via ConduitJul 19 2008, 3:48 PM

ravi.chhabra wrote:

U+1037 is int(7) and U+103A is int(9), this means that U+1037 should always be first? This seems so similar to the pitah-dagesh issue. :(

This is the relevant section of $utfCombingClass:

["့"]=>
int(7)
["္"]=>
int(9)
["်"]=>
int(9)

The order given here does not seem to be the same as the order given in UTN#11. I guess this would be a lesson not to take UTN's too seriously. I do like the sort order as it is in Wikipedia, just that it's having problems with Fonts. And I am a bit surprised that data in UCS does not match what was authored in the UTN. So as far as MedaiWiki is concerned, it's just like the way it is with Hebrew. We will now need to move over to Unicode mailing list and ask what's going on. Simetrical, many thanks for clearing this one up for me. :)

As a side note developer of Parabaik font gave me this link http://ngwestar.googlepages.com/padaukvsmyanmar3
I noticed that the sequence was recently changed to have been mentioned.

bzimport added a comment.Via ConduitJul 23 2008, 11:44 PM

ravi.chhabra wrote:

Found something which should not have been re-sequenced.

Input: U+101E U+1004 U+103A U+1039 U+1001 U+103B U+102C
Output: U+101E U+1001 U+103B U+102C U+1004 U+103A U+1039

The output is wrong because U+1004 is consonant, and U+1001 is also consonant. Hence MedaiWiki should not have swapped them, that is if my understanding of Unicode Normalization is correct. My understanding is that the sorting starts over whenever a new consonant starts, because this is the beginning of a new syllable cluster. No fonts will be able to render the output from Mediawiki.

bzimport added a comment.Via ConduitJul 24 2008, 12:34 AM

ayg wrote:

I suggest you e-mail Tim Starling.

bzimport added a comment.Via ConduitOct 23 2008, 2:28 AM

ravi.chhabra wrote:

I am adding it here that the issue with Myanmar Unicode (Lower Dot and Visible Virama) is an issue that will be covered in the revision to UTN#11 as a foresight in the standards review process. Due to the stability criteria of UnicodeData.txt there is nothing we can do about this. This is not a MediaWiki bug, since many people are now referencing this to point out as a bug, I need to clarify this here. This sadly does mean that fonts and IMEs will need to update and mean while MediaWiki 1.4 will have the problem mentioned here and the way to resolve this is to simply wait update fonts and IMEs. The advantage of turning off Normalization far outweigh the disadvantage. If there are plans to adopt a less invasive normalization process as mentioned in Normalization Concerns than the issue can be resolved. The developers of fonts and IMEs have agreed to update so those implementing MediaWiki install bases might want to keep Normalization on.

The 2nd issue with Kinzi (comment #45) seems to be resolved now. Was MediaWiki updated between July and now??

bzimport added a comment.Via ConduitJan 6 2010, 12:20 PM

gangleri wrote:

FYI: https://bugzilla.wikimedia.org/show_activity.cgi?id=2399
I did not change priorities; I only added me as CC:.
It seams that the Priority field is gone.

Amire80 added a comment.Via ConduitMay 22 2011, 7:45 AM

Marking REOPENED. The standard was updated since 2006. We discussed this in the Berlin Hackathon.

Amire80 added a comment.Via ConduitMay 26 2011, 5:47 PM

See another demonstration of this problem here:

http://en.wikisource.org/wiki/User:Amire80/Havrakha

brion added a comment.Via ConduitMay 26 2011, 5:54 PM

Assigning to me so we can look over the current state and see about fixing it up.

Verdy_p added a comment.Via ConduitSep 3 2011, 3:58 PM

Apparently, you have not implemnted the contractions and expansions of UCA.

Note that there has been NO change in Unicode 5.1 (or later) for the normalization which is now stabilized since at least Unicode 4.0.1.
The bugs above are most probably not related to normalization, if it is implemented correctly (and normalization is an easy problem that can be implemtned very efficiently).

And the changes in the DUCET (or now the CLDR DUCET) do not affect how Hebrew, Arabic or Myanmar is sorted, within the same script.

Then you should learn to separate the Unicode Normalization Algorithm (UNA), the Unicode Collation Algorithm (UCA), and the Unicode Bidi Algorithm (UBA), because the Bidi algorithm only affects the display, but definitely NOT the other two.

And the order produced by normalization is orthogonal to the order of collation weights generated by UCA, even if normalization is assumed to be performed first before computing collations (but this is not a requirement, it just helps reducing the problem, by making sure that canonically equivalent strings will collate the same.

Many posters above seem to be completely mixing the problems !

Verdy_p added a comment.Via ConduitSep 3 2011, 4:00 PM

Note: for Thai, Lao, Tai Viet, the normalization does not reorder the prepended vowels (neither do the Bidi algorithm).

But such reordering is *required* when implementing the UCA, and this takes the form of contractions and expansions, that are present in the DUCET for these scripts.

Verdy_p added a comment.Via ConduitSep 3 2011, 4:33 PM

Final note: it is highly recommanded to NOT save texts with an implicit normalization. Even if normalization is implemted correctly.

There are known defects (yes bugs in renderers of browsers that frequently do not implement normalizations and that are not able to sort, combine and position the diacritics correctly if they are not in a specific order, which is not the same as the normalized order)

There are also because incorrect assumptions made by writers (that have not understood when and where to insert CGJ to restrict the normalization of reordering some pairs of diacritics), and so have written their texts in such a way that they "seem" to render correctly, but only on a bogous browser not performing the normalizations correctly and/or with strong limitations in their text renderer (unable to recognize strings that are canonically equivalent but for which they expect only one order for successive diacritics in order to position them correctly).

This type of defects is typical of the "bug" described above about the normalized order of the DAGESH (a central point in the middle of a consonannt letter, in order to modify it) or SIN/SHIN DOTS (above the letter, on the left or right, also modifying the consonnant), and the other Hebrew vowel diacritics: Yes the normalization reorders the vowel diacritics before the diacritics that modify the consonnant (this is the effect of an old assignment of their relative "combining classes", in a completely illogical order of values, but this will NEVER be changed as it would affect the normalizations).

But many renderers are not able to display correctly the strings that are encoded in normalized order (base consonnant, vowel diacritic, sin dot or shin dot or dagesh). Instead they expect that the string will be encoded as (base consonnant, dagesh or sin dot or shin dot, vowel diacritic), even if it is completely canonically equivalent to the previous and should display exactly the same ! (such rendering bugs were found in old versions of Windows with IE6 or before).

For this reason, you should not, on MediaWiki, apply any implicit renormalization of any edited text. If one wants to enter (base consonnant, dagesh or sin dot or shin dot, vowel diacritic) in the Wiki text, keep it unchanged, do not normalize it, as it will display correctly on both the old bogous renderers and on newer ones.

Verdy_p added a comment.Via ConduitSep 3 2011, 4:37 PM

All my remarks in the previous message also apply to the Arabic diacritics.

For example the assumptions made by Brion Viber in his message #23 are completely wrong. He has not understood what is normalization and the fact that, only with conforming renderers, the normalization *must not* affect the rendering (but if they do, this is due to bugs in renderers, not bugs in the normalizer used on MediaWiki).

bzimport added a comment.Via ConduitSep 29 2011, 12:48 PM

merelogic wrote:

*** Bug 31183 has been marked as a duplicate of this bug. ***

kaldari added a comment.Via ConduitDec 8 2011, 9:39 PM

This should probably be reassigned to one of our localization engineers.

Matanya added a comment.Via ConduitJul 30 2012, 1:53 PM

reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha

bzimport added a comment.Via ConduitFeb 22 2014, 6:06 PM

dovijacobs wrote:

For an extremely clear description of the problem in Hebrew, see here (pp. 8 ff.):
http://www.sbl-site.org/Fonts/SBLHebrewUserManual1.5x.pdf

Aklapper added a comment.Via ConduitMay 18 2014, 9:42 AM

Amir: Do you (or the L10N team) plan to take a look at this at some point?
This ticket is place 14 in the list of open tickets with the highest votes...

Qgil added a subscriber: Qgil.Via WebDec 12 2014, 5:15 PM

... and one of the oldest open and assigned tasks.

Qgil placed this task up for grabs.Via WebJan 9 2015, 10:30 PM
Qgil added a subscriber: Language-Engineering.

reassigned to Amir as he is part of localization engineers. This bug is still present as can seen in : https://en.wikisource.org/wiki/User:Amire80/Havrakha

@Amire80 didn't take this task himself, so I placed it up for grabs. CCing Language-Engineering instead.

Liuxinyu970226 removed a subscriber: Liuxinyu970226.Via WebFeb 27 2015, 9:55 AM
Krinkle added a project: utfnormal.Via WebApr 6 2015, 12:42 PM
Krinkle set Security to None.
Aklapper added a project: RTL.Via WebJun 2 2015, 2:20 PM
Restricted Application added a project: I18n. · View Herald TranscriptVia HeraldJun 2 2015, 2:20 PM

Add Comment