Page MenuHomePhabricator

Punjabi Gurmukhi nukta / bindi character NFC normalization should be turned off
Open, Needs TriagePublic

Description

These nukta / bindi characters of the Gurmukhi Unicode block have precomposed forms, which the Unicode NFC normalization specification has exceptions for to decompose them to the "parent" character + nukta / bindi attaching character.

ਸ਼ ਖ਼ ਲ਼ ਗ਼ ਫ਼ ਜ਼

This apparently seems to be for purported backwards compatibility issues, but the current situation on the web is that the precomposed characters are preferred by most websites and databases which use Punjabi Gurmukhi. This is understandable, as these letters represent one single consonant each, and it is quite annoying for users to have to press backspace twice for them while not having to for others. Keyboard layouts tend to use the precomposed characters.

The use of precomposed characters in URLs makes many Punjabi websites and external identifiers unlinkable from Wikimedia sites. For example, you can see here https://www.wikidata.org/wiki/Lexeme:L697770 the Sri Granth ID link which does exist is broken. Entering escape sequence manually in the property does not work either. This is a problem for the lexeme data itself as well, for reconciling against other databases, for transliteration to Shahmukhi (Perso-Arabic script), and for use with newer fonts which tend to operate under the assumption that people are using the preferred precomposed characters.

I am not sure where the most effective and least controversial place to change this is. Would Unicode ever change this? Could this be changed in the NFC normalization library itself, or should it be changed on a case by case basis for inputs in Wikimedia projects where an override would be particularly warranted? Maybe someone here knows

Event Timeline

See also T206188 , similar task related to nukta letter error in Bengali