Some names like "Aachen" are sorted wrongly in Norwegian
Open, Needs TriagePublic


In Norwegian the bigram "Aa" is usually sorted as "Å", but in some cases like "Aachen" this is wrong. The solution proposed by Unicode is to use either U+00AD SOFT HYPHEN and U+034F COMBINING GRAPHEME JOINER, but as the former gives a hyphen in some cases where it should not, like Aachen, the preference is often to use the later.

Thus create an "entity" cgj for COMBINING GRAPHEME JOINER.

For a description of the codepoint, see w:Combining Grapheme Joiner.

A more involved solution is given at T175802: Collation sequences, failed contraction matches in Norwegian

jeblad created this task.Nov 27 2018, 12:47 PM
Restricted Application added subscribers: Danmichaelo, jhsoby, Aklapper. · View Herald TranscriptNov 27 2018, 12:47 PM

Change 476010 had a related patch set uploaded (by John Erling Blad; owner: John Erling Blad):
[mediawiki/core@master] Core: Added entity for "combining grapheme joiner"

jeblad updated the task description. (Show Details)Nov 27 2018, 1:16 PM

Change 476010 abandoned by John Erling Blad:
Core: Added entity for "combining grapheme joiner"

This bug needs a solution, and there is a solution, but I doubt any change is going to be accepted.

Seems like this is a non-existing problem, or rather "problem does not exist in my language".
Perhaps someone else can (and should) fix it.

cscott added a subscriber: cscott.Nov 28 2018, 8:24 PM

I don't have any problem with using the unicode codepoint. I just don't think we should invent a bogus entity name for it. You can use the codepoint as a hex or decimal character reference in wikitext. Perhaps even file a bug with the W3C/WHATWG to add the &cgj; entity upstream. Many places in MW assume that the MW entity names correspond to valid HTML entity names; I don't think wikitext shouldn't have its own nonstandard entities.

jeblad removed jeblad as the assignee of this task.Nov 29 2018, 7:23 PM