Normalization of Arabic presentation forms
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	tstarling
	Mar 25 2007, 4:43 PM

Description

According to the Unicode FAQ:

Q. Is it necessary to use the presentation forms that are defined in Unicode?

A. No, it is not necessary to use those presentation forms. Those forms were

selected and identified in the early days of developing Unicode when
sophisticated rendering engines were not prevalent. A selected subset of the
presentation forms was included to provide users with a simple method to
generate them.

Q. Can one use the presentation forms in a data file?

A. It is strongly discouraged and not recommended because it does not

guarantee data integrity and interoperability. In the particular case of Arabic,
data files should include only the characters in the Arabic block, U+0600 to U+06FF.

Unidentified broken clients are inserting Arabic presentation forms into
articles on ar.wikipedia.org. This causes problems because some browsers do not
display these characters. I suggest we convert presentation forms to their
canonical equivalent during NFC normalisation on page save. For those rare cases
where isolated characters in specified forms are required, HTML character
entities can be used.

Version: 1.10.x
Severity: normal

Details

Reference: bz9413

Related Objects

Mentioned In: T94826: Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular)

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:40 PM

• bzimport added a project: MediaWiki-Page-editing.

• bzimport set Reference to bz9413.

• bzimport added a subscriber: Unknown Object (MLST).

tstarling created this task.Mar 25 2007, 4:43 PM

We already do NFC normalization on page save. Are you asking for additional
conversions?
If so, can you specify?

Yes additional conversions. The Arabic presentation forms (FB50-FDFF and
FE80-FEFF) should be converted to their equivalents in the Arabic block,
0600-06FF. The relevant mapping is given in the Decomposition_Mapping field of
UnicodeData.txt. For example:

FB51;ARABIC LETTER ALEF WASLA FINAL FORM;Lo;0;AL;<final> 0671;;;;N;;;;;

Because there is a formatting tag "<final>", this is a compatibility mapping
(part of NFKC), rather than a canonical mapping (part of NFC).

Fixed in r60599.

whym mentioned this in T94826: Don't crash when MediaWiki returns a page title different from the query because of normalization (Arabic and Malayalam normalization in particular).May 28 2016, 7:56 AM

Normalization of Arabic presentation formsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Normalization of Arabic presentation forms
Closed, ResolvedPublic
Actions