Collation sequences, failed contraction matches in Norwegian
Open, Needs TriagePublic

Description

This is a bit weird, but the problem exists on several languages, and at least on German, Dutch, Danish, Swedish, and Norwegian.

Given that I am a pink Norwegian Viking writing Norwegian
and want to categorize and sort "Ivar Aasen" properly
When I write {{DEFAULTSORT:Aasen, Ivar}}
Then I expect "Aa" to be sorted as "Å" (that is, it is a contraction match)

Now compare to

Given that I am a pink Norwegian Viking writing Norwegian
and want to categorize and sort "Aachen" properly
When I write {{DEFAULTSORT:Aachen}}
Then I expect "Aa" to not be sorted as "Å" (that is, it should not be a contraction match)

There are no simple rules to handle this, so I would propose that we add a function CUSTOMSORT that inject U+034F COMBINING GRAPHEME JOINER for letters that would otherwise result in a contraction match. I can't see a simple way to handle mixed matches, a name with both "aa" and "å", but I can't see any reason for doing that either. There is although a few such names, like "Åsgaardsreien" (a mythological fabel), and the family names Åsgaard, Årvaag and Ågaard. These should be sorted as contraction matches.

Note the quote from http://www.unicode.org/reports/tr10/#Input_Matching

A sequence of characters which otherwise would result in a contraction match can be made to sort as separate characters by inserting, someplace within the sequence, a starter that maps to a completely ignorable collation element. By definition this creates a blocking context, even though the completely ignorable collation element would not otherwise affect the assigned collation weights. There are two characters, U+00AD SOFT HYPHEN and U+034F COMBINING GRAPHEME JOINER, that are particularly useful for this purpose. These can be used to separate sequences of characters that would normally be weighted as units, such as Slovak "ch" or Danish "aa".

It also points to http://www.unicode.org/reports/tr10/#Combining_Grapheme_Joiner which says U+034F COMBINING GRAPHEME JOINER should be used in these cases.

In some cases the contraction match hits strings like "temaavis". This is a portmanteau of "theme" and "newspaper" and the "aa" should not giv a contraction match. In those cases it should be valid to use ­ like in {{DEFAULTSORT:tema­avis}}. In these cases there might be other matches, like {{DEFAULTSORT:Aadalen (tema­avis)}}. The example is constructed, not sure where there are any real examples. Other such words are "madonna­avbildning", "fakta­ark", og "data­animert". A title in use is "meta­analyse", which will now be sorted as "metånalyse" which is wrong. It does not pose any real problem, as the invalid sorting only hits at the third letter.

An alternative is to have a kind of TITLELANG for the language of the title string, and then fallback to verbatim sorting for non-conforming languages. That is languages that are not a macro language of the content language. The content language "nb" would be part of "no", and a title language of "nn" should be handled as conforming. That could solve the sorting of "Amund Maarud" (https://no.wikipedia.org/wiki/Amund_Maarud) vs "Amin Maalouf" (https://no.wikipedia.org/wiki/Amin_Maalouf) where the last name is non-Norwegian, but it would not solve "Alta" vs "Aachen" where both names are valid in Norwegian.

jeblad created this task.Sep 13 2017, 9:14 AM
Restricted Application added subscribers: Danmichaelo, jhsoby, Aklapper. · View Herald TranscriptSep 13 2017, 9:14 AM
jeblad updated the task description. (Show Details)Sep 13 2017, 11:25 AM
jeblad updated the task description. (Show Details)Sep 13 2017, 11:47 AM
jeblad updated the task description. (Show Details)Sep 13 2017, 11:51 AM

I'm sorry, but I think you're making this too complicated. The simplest solution is to not automatically sort "aa" as "å", and use DEFAULTSORT where "aa" should be sorted as "å". For the Norwegian Bokmål Wikipedia, this is already more or less done – I've done several rounds myself of checking articles with "aa" in their name to fix their sorting where necessary, and it's not too much work.