Page MenuHomePhabricator

Some names like "Aachen" are sorted wrongly in Norwegian
Open, Needs TriagePublic

Description

In Norwegian the bigram "Aa" is usually sorted as "Å", but in some cases like "Aachen" this is wrong. The solution proposed by Unicode is to use either U+00AD SOFT HYPHEN and U+034F COMBINING GRAPHEME JOINER, but as the former gives a hyphen in some cases where it should not, like Aachen, the preference is often to use the later.

Thus create an "entity" cgj for COMBINING GRAPHEME JOINER.

For a description of the codepoint, see w:Combining Grapheme Joiner.

A more involved solution is given at T175802: Collation sequences, failed contraction matches in Norwegian

Event Timeline

Change 476010 had a related patch set uploaded (by John Erling Blad; owner: John Erling Blad):
[mediawiki/core@master] Core: Added entity for "combining grapheme joiner"

https://gerrit.wikimedia.org/r/476010

Change 476010 abandoned by John Erling Blad:
Core: Added entity for "combining grapheme joiner"

Reason:
This bug needs a solution, and there is a solution, but I doubt any change is going to be accepted.

https://gerrit.wikimedia.org/r/476010

Seems like this is a non-existing problem, or rather "problem does not exist in my language".
Perhaps someone else can (and should) fix it.

I don't have any problem with using the unicode codepoint. I just don't think we should invent a bogus entity name for it. You can use the codepoint as a hex or decimal character reference in wikitext. Perhaps even file a bug with the W3C/WHATWG to add the &cgj; entity upstream. Many places in MW assume that the MW entity names correspond to valid HTML entity names; I don't think wikitext shouldn't have its own nonstandard entities.