Page MenuHomePhabricator

Arabic and western text on the same line causes incorrect interleaving
Closed, InvalidPublic

Description

Author: joona.palaste

Description:
In the English Wikipedia article on Ruhollah Khomeini, having his name in the
native Arabic script (right-to-left) inside the normal English (left-to-right)
text of the article causes incorrect interleaving.
I am using Mozilla Firefox 2.0 on Fedora Core 5 Linux, and on my browser the
first two lines of the text look something like this:

"Grand Ayatollah Seyyed Ruhollah Mosavi Khomeini (listen (Persian pronunciation)
(help·info)) (Persian: [Arabic text]
[Arabic text] Rūḥollāh Mūsavī Khomeynī Arabic: 17) ([Arabic text] May 1900¹ - 3
June 1989) was a..."

I've placed "[Arabic text]" where it displays Arabic text so that this bug
report itself does not depend on the settings of the browser but illustrates the
issue as I see it. The problem is plainly visible: It is supposed to say that
Khomeini was born on 17 May 1900, but part of his Arabic name appears between
the day "17" and the month "May 1900". When checking the wiki markup source
code, everything looks OK, the Arabic text is correctly interleaved with the
western text.

Is this a bug with MediaWiki or with my browser?


Version: unspecified
Severity: normal
OS: Linux
Platform: PC
URL: http://en.wikipedia.org/wiki/Ruhollah_Khomeini

Details

Reference
bz9011

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:35 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz9011.
bzimport added a subscriber: Unknown Object (MLST).

joona.palaste wrote:

A detail screenshot of the rendered article text illustrating the problem.

Attached:

khomeini.png (200×640 px, 45 KB)

This can be fixed by surrounding the text with ‏ and ‎ (for example,
‏Rūḥollāh Mūsavī Khomeynī‎). I suppose this could be templated as
{{rtl|Rūḥollāh Mūsavī Khomeynī}}, if that template doesn't already exist.

See the previous bug 8996 about similar behaviour on special pages. I think a
serverside fix would be applicable to all instances of the direction override
problem.

  • This bug has been marked as a duplicate of 8996 ***

rotemliss wrote:

Bug 8996 is about a completely different problem, talking about a different kind
of direction marks. It's not a duplicate.

ayg wrote:

It's impossible to get correct directionality information from plain Unicode
text. Consider:

The Hebrew letter "aleph" is א, ב is "bet".

Note that aleph is א, bet is ב, and the logical order (as I typed it and as it
was encoded) has the א before the ב. The comma and space fall between two RTL
characters, so they're treated as RTL embedded in LTR. But semantically, the
comma is part of the LTR phrase (delimiting two LTR phrases, which happen to end
or begin with RTL characters) and should be treated as LTR text.

But consider this, which is syntactically identical:

Exodus 1:2 reads, in the original Hebrew: "ראובן, שמעון, לוי, ויהודה".

Here the behavior is correct, because in this context, the commas delimit RTL
phrases (or words), not LTR phrases. But there's no possible way either
MediaWiki or the browser could know that. The Unicode directionality algorithm
tries to do the impossible, and consequently fails. The only way to avoid this
problem is to add semantic information on how you want the directionality to go,
using Unicode directionality marks:

The Hebrew letter "aleph" is א‎, ב is "bet".