Page MenuHomePhabricator

Appropriately ignore diacritics for German-language wikis
Open, NormalPublic

Description

When searching for certain words with diacritics the user gets different then when using none.
I stumbled over this issue when looking for Dhalle respective Dhallë in WP:DE.¹
When a "normal" user sees somewhere a word with strange diacritics not easily to reproduce the search engine still should deliver helpful results.

¹
Search for Dhalle:


Search for Dhallë:

Also, be sure to update the German WP documentation on character folding.

Event Timeline

Malenki created this task.Jul 5 2015, 9:04 PM
Malenki raised the priority of this task from to Needs Triage.
Malenki updated the task description. (Show Details)
Malenki added a project: CirrusSearch.
Malenki added a subscriber: Malenki.
Restricted Application added a project: Discovery. · View Herald TranscriptJul 5 2015, 9:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is an extremely broad request that asks for normalization rules for any kind of languages hence it sounds unfixable?
If there are specific and well-defined requests (such as T71361) those might be doable though.

Deskana renamed this task from ignore diacritics to ignore diacritics whenever appropriate (for some value of appropriate).Dec 31 2015, 5:04 AM
Deskana triaged this task as Low priority.
Deskana set Security to None.
Deskana moved this task from Needs triage to Search on the Discovery board.

@Aklapper is right that this is a very broad request, and we should really take a language-by-language approach. I suggest re-purposing this ticket to "appropriately ignore diacritics for German-language wikis", or closing it and opening another with that purpose. I also summon @debt to weigh in on that and subsequent prioritization. (Sorry, Deb!)

I'm going to set aside ß for a minute. Weird stuff is happening with ß (see T87136).

A quick test reveals that the built-in Elastic German language analyzer treats e's with diacritics weirdly. Acute accents are removed from a, i, o, and u, but not é. Same for grave accents (è), umlauts (ë), and circumflexes (ê)—all stripped from a, i, o, and u, but not e.

Despite not having much currency in German orthography, none of å, ø, ñ, ã, õ, ÿ, ç, œ, or æ are converted/folded.

The German language analyzer does convert/fold ä, ö, and ü to their plain counterparts as part of its normal processing, so I'll go with that being correct.

We could unpack the German analyzer into its constituent parts (as we've done with English and French—see T142620 and my write up for French), and enable general ICU folding. We can also configure any folding exceptions we need—though I don't think we need any.

We could also see if there's anything useful we can do to make ß behave as we'd want—or at least get a better idea
of where the unwanted conversion to ss is happening—though we may have to handle that separately in T87136.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 27 2017, 9:38 PM
debt raised the priority of this task from Low to Normal.Jun 28 2017, 6:06 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Thanks, @TJones - I'll repurpose this ticket. :)

debt renamed this task from ignore diacritics whenever appropriate (for some value of appropriate) to Appropriately ignore diacritics for German-language wikis.Jun 28 2017, 6:06 PM
TJones updated the task description. (Show Details)Oct 27 2017, 4:56 PM
TJones updated the task description. (Show Details)