Appropriately ignore diacritics for German-language wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Malenki
	Jul 5 2015, 9:04 PM

Description

When searching for certain words with diacritics the user gets different then when using none.
I stumbled over this issue when looking for Dhalle respective Dhallë in WP:DE.¹
When a "normal" user sees somewhere a word with strange diacritics not easily to reproduce the search engine still should deliver helpful results.

¹
Search for Dhalle:

2015-07-05_230005_scr_wp-de_dhalle-suche.png (1×1 px, 208 KB)

Search for Dhallë:

2015-07-05_230021_scr_wp-de_dhalle-suche.png (1×1 px, 183 KB)

Also, be sure to update the German WP documentation on character folding.

Related Objects
Search...

Status	Assigned	Task
Open	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T281379 Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
Resolved	TJones	T284185 Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
Resolved	TJones	T104814 Appropriately ignore diacritics for German-language wikis

Event Timeline

Malenki created this task.Jul 5 2015, 9:04 PM

Malenki raised the priority of this task from to Needs Triage.

Malenki updated the task description. (Show Details)

Malenki added a project: CirrusSearch.

Malenki subscribed.

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJul 5 2015, 9:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is an extremely broad request that asks for normalization rules for any kind of languages hence it sounds unfixable?
If there are specific and well-defined requests (such as T71361) those might be doable though.

• Deskana renamed this task from ignore diacritics to ignore diacritics whenever appropriate (for some value of appropriate).Dec 31 2015, 5:04 AM

• Deskana triaged this task as Low priority.

• Deskana set Security to None.

• Deskana moved this task from Inbox to Multilingual and cross-project on the CirrusSearch board.

• Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.

ObsequiousNewt mentioned this in T132637: Lack of diacritic folding in e.g. Ancient Greek.Apr 13 2016, 10:01 PM

@Aklapper is right that this is a very broad request, and we should really take a language-by-language approach. I suggest re-purposing this ticket to "appropriately ignore diacritics for German-language wikis", or closing it and opening another with that purpose. I also summon @debt to weigh in on that and subsequent prioritization. (Sorry, Deb!)

I'm going to set aside ß for a minute. Weird stuff is happening with ß (see T87136).

A quick test reveals that the built-in Elastic German language analyzer treats e's with diacritics weirdly. Acute accents are removed from a, i, o, and u, but not é. Same for grave accents (è), umlauts (ë), and circumflexes (ê)—all stripped from a, i, o, and u, but not e.

Despite not having much currency in German orthography, none of å, ø, ñ, ã, õ, ÿ, ç, œ, or æ are converted/folded.

The German language analyzer does convert/fold ä, ö, and ü to their plain counterparts as part of its normal processing, so I'll go with that being correct.

We could unpack the German analyzer into its constituent parts (as we've done with English and French—see T142620 and my write up for French), and enable general ICU folding. We can also configure any folding exceptions we need—though I don't think we need any.

We could also see if there's anything useful we can do to make ß behave as we'd want—or at least get a better idea
of where the unwanted conversion to ss is happening—though we may have to handle that separately in T87136.

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJun 27 2017, 9:38 PM

Thanks, @TJones - I'll repurpose this ticket. :)

debt renamed this task from ignore diacritics whenever appropriate (for some value of appropriate) to Appropriately ignore diacritics for German-language wikis.Jun 28 2017, 6:06 PM

TJones mentioned this in T179081: Full text search does not find article with accented word in dewiki.Oct 27 2017, 2:53 PM

TJones updated the task description. (Show Details)Oct 27 2017, 4:56 PM

TJones updated the task description. (Show Details)

FriedhelmW subscribed.Oct 27 2017, 5:29 PM

TJones moved this task from Up Next to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:44 PM

TJones mentioned this in T226812: de.wikipedia: search for "Bedusz" does not find "Będusz".Jul 8 2019, 8:47 PM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 9:09 PM

After T281379 is deployed and T284185 is complete, recheck this ticket. I believe it should be fixed.

Gehel added a parent task: T284185: Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions.Jun 9 2021, 3:10 PM

This is generally partially fixed by the recent changes in T281379 & T284185. However, this specific example does not work as one might hope.

Searching for Deuch or Dëuch returns the same results (possibly ordered a little differently), and the Będusz/Bedusz example from T226812 also now works as expected.

The Dhallë/Dhalle example is a little different because it interacts with the German stemmer. The German stemmer removes the final -e from Dhalle, but not the -ë from Dhallë—then the ICU normalization converts ë to e. (So the stems are Dhallë → dhalle and Dhalle → dhall.)

We could move ICU folding before the stemmer, but there could be unintended consequences there, too, and that would require more testing.

I'm inclined to call this ticket done, since the general issue of "strange diacritics" has been addressed, and there are always corner cases in language where things don't work as intended. But if the consensus is that the specific case of Dhallë/Dhalle needs to be addressed, we can put it back in the backlog.

In T104814#7170085, @TJones wrote:

I'm inclined to call this ticket done, since the general issue of "strange diacritics" has been addressed

	F188892: 2015-07-05_230021_scr_wp-de_dhalle-suche.png
	Jul 5 2015, 9:04 PM

	F188891: 2015-07-05_230005_scr_wp-de_dhalle-suche.png
	Jul 5 2015, 9:04 PM

Appropriately ignore diacritics for German-language wikisClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Appropriately ignore diacritics for German-language wikis
Closed, ResolvedPublic
Actions

Related Objects
Search...