Page MenuHomePhabricator

Appropriately ignore diacritics for German-language wikis
Closed, ResolvedPublic

Description

When searching for certain words with diacritics the user gets different then when using none.
I stumbled over this issue when looking for Dhalle respective Dhallë in WP:DE.¹
When a "normal" user sees somewhere a word with strange diacritics not easily to reproduce the search engine still should deliver helpful results.

¹
Search for Dhalle:

2015-07-05_230005_scr_wp-de_dhalle-suche.png (1×1 px, 208 KB)

Search for Dhallë:
2015-07-05_230021_scr_wp-de_dhalle-suche.png (1×1 px, 183 KB)

Also, be sure to update the German WP documentation on character folding.

Event Timeline

Malenki raised the priority of this task from to Needs Triage.
Malenki updated the task description. (Show Details)
Malenki added a project: CirrusSearch.
Malenki added a subscriber: Malenki.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is an extremely broad request that asks for normalization rules for any kind of languages hence it sounds unfixable?
If there are specific and well-defined requests (such as T71361) those might be doable though.

Deskana renamed this task from ignore diacritics to ignore diacritics whenever appropriate (for some value of appropriate).Dec 31 2015, 5:04 AM
Deskana triaged this task as Low priority.
Deskana set Security to None.
Deskana moved this task from Needs triage to Search on the Discovery board.

@Aklapper is right that this is a very broad request, and we should really take a language-by-language approach. I suggest re-purposing this ticket to "appropriately ignore diacritics for German-language wikis", or closing it and opening another with that purpose. I also summon @debt to weigh in on that and subsequent prioritization. (Sorry, Deb!)

I'm going to set aside ß for a minute. Weird stuff is happening with ß (see T87136).

A quick test reveals that the built-in Elastic German language analyzer treats e's with diacritics weirdly. Acute accents are removed from a, i, o, and u, but not é. Same for grave accents (è), umlauts (ë), and circumflexes (ê)—all stripped from a, i, o, and u, but not e.

Despite not having much currency in German orthography, none of å, ø, ñ, ã, õ, ÿ, ç, œ, or æ are converted/folded.

The German language analyzer does convert/fold ä, ö, and ü to their plain counterparts as part of its normal processing, so I'll go with that being correct.

We could unpack the German analyzer into its constituent parts (as we've done with English and French—see T142620 and my write up for French), and enable general ICU folding. We can also configure any folding exceptions we need—though I don't think we need any.

We could also see if there's anything useful we can do to make ß behave as we'd want—or at least get a better idea
of where the unwanted conversion to ss is happening—though we may have to handle that separately in T87136.

debt raised the priority of this task from Low to Medium.Jun 28 2017, 6:06 PM
debt moved this task from needs triage to Up Next on the Discovery-Search board.

Thanks, @TJones - I'll repurpose this ticket. :)

debt renamed this task from ignore diacritics whenever appropriate (for some value of appropriate) to Appropriately ignore diacritics for German-language wikis.Jun 28 2017, 6:06 PM
TJones updated the task description. (Show Details)

After T281379 is deployed and T284185 is complete, recheck this ticket. I believe it should be fixed.

This is generally partially fixed by the recent changes in T281379 & T284185. However, this specific example does not work as one might hope.

Searching for Deuch or Dëuch returns the same results (possibly ordered a little differently), and the Będusz/Bedusz example from T226812 also now works as expected.

The Dhallë/Dhalle example is a little different because it interacts with the German stemmer. The German stemmer removes the final -e from Dhalle, but not the -ë from Dhallëthen the ICU normalization converts ë to e. (So the stems are Dhallë → dhalle and Dhalle → dhall.)

We could move ICU folding before the stemmer, but there could be unintended consequences there, too, and that would require more testing.

I'm inclined to call this ticket done, since the general issue of "strange diacritics" has been addressed, and there are always corner cases in language where things don't work as intended. But if the consensus is that the specific case of Dhallë/Dhalle needs to be addressed, we can put it back in the backlog.

TJones claimed this task.

I'm inclined to call this ticket done, since the general issue of "strange diacritics" has been addressed