~"daß" should not match "dass"
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	FriedhelmW
	Jan 18 2015, 8:42 AM

Description

Quotes turn on exact term matches. But searching for ~"daß" also finds pages containing "dass" on de.wikipedia. Compare this to the behaviour of Google which finds 28.200 "daß" and 1.630.000 "dass" on site:de.wikipedia.org.

Details

	Subject	Repo	Branch	Lines +/-
	Unpack German, Portuguese, and Dutch Elasticsearch Analyzers	mediawiki/extensions/CirrusSearch	master	+939 -234

Customize query in gerrit

Related Objects

Mentioned In: T284185: Reindex German, Dutch, and Portugese Wikis to Enabled Unpacked Versions
T272606: [EPIC] Unpack all Elasticsearch analyzers
T226812: de.wikipedia: search for "Bedusz" does not find "Będusz"
T182856: Add current issues to "exactly this text" helptext
T182447: Search in "" does not distinguish between "ss" and "ß"
T104814: Appropriately ignore diacritics for German-language wikis
T147636: German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
T87112: Wrong stemming in German
T90089: ~"Maße" should not match "Masse"
Mentioned Here: T281379: Unpack German, Portuguese, and Dutch Elasticsearch Analyzers
T90089: ~"Maße" should not match "Masse"

Event Timeline

FriedhelmW created this task.Jan 18 2015, 8:42 AM

FriedhelmW raised the priority of this task from to Needs Triage.

FriedhelmW updated the task description. (Show Details)

FriedhelmW added a project: CirrusSearch.

FriedhelmW subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2015, 8:42 AM

Another example: ~"Maße" and ~"Masse" (T90089). They have different meaning in German.

FriedhelmW renamed this task from ~"daß" matches "dass" to ~"daß" should not match "dass".Jan 23 2015, 7:23 AM

FriedhelmW set Security to None.

This was discussed on de.wp in https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia/Archiv/2015/Woche_02#Wie_kann_man_gezielt_nach_.22da.C3.9F.22_suchen.3F.

FriedhelmW added a project: MediaWiki-Search.Feb 19 2015, 12:01 PM

@FriedhelmW: Did you reproduce this problem on a local MediaWiki instance with its default search backend, or why did you add the project "MediaWiki-Search" to this task? Clarifying comment welcome. :)

Now I tested on an other wiki, and MediaWiki-Search is not affected. Thank you for clarifying!

FriedhelmW mentioned this in T90089: ~"Maße" should not match "Masse".Feb 20 2015, 6:11 AM

This sounds like a unicode normalization issue. Some details can be found on the Elasticsearch site. I thought we were using [[https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Maintenance/AnalysisConfigBuilder.php#L229-L234|nfkc]] normalization which should not in my understanding normalize ß to ss but it looks like that is what is happening here and in T90089.

When using quotes, no normalization or stemming must be done (except for converting to lower case). Of course, there needs to be an unnormalized index, too.

bd808 merged a task: T90089: ~"Maße" should not match "Masse".Feb 25 2015, 7:34 PM

• Manybubbles triaged this task as Medium priority.Feb 25 2015, 7:39 PM

bd808 lowered the priority of this task from Medium to Low.Feb 25 2015, 7:43 PM

• Manybubbles mentioned this in T87112: Wrong stemming in German.Feb 25 2015, 7:52 PM

• Deskana moved this task from Inbox to Advanced functionality and syntax on the CirrusSearch board.Jan 12 2016, 9:13 PM

Restricted Application added a project: Discovery-ARCHIVED. · View Herald TranscriptJan 12 2016, 9:13 PM

• Deskana moved this task from Needs triage to Search on the Discovery-ARCHIVED board.Jan 16 2016, 1:37 AM

• Deskana moved this task from Advanced functionality and syntax to Multilingual and cross-project on the CirrusSearch board.Feb 12 2016, 11:26 PM

MGChecker subscribed.Jul 10 2016, 2:32 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 10 2016, 2:32 PM

Restricted Application added a subscriber: Luke081515. · View Herald Transcript

We'd need to update the custom analysis chain to not normalize this.

when using quotes in cirrussearch the exact same fields (specific analyzed versions of text) in elasticsearch are queried. Without moving the parsing into php we don't really have control over how that happens.

Aklapper mentioned this in T147636: German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions.Oct 7 2016, 1:01 PM

Filed https://github.com/elastic/elasticsearch/pull/20814
Not sure it's the right approach for us but It will give us a chance to blacklist some chars which is not currently possible.
The reason ß is folded to ss is because nfkc_cf is a case folding technique which is designed to group words into an unique form regardless of the input:
DASS => dass
daß => dass

The question is: should we track a per language list of chars to exclude from nfkc_cf normalization or should we simply disable nfkc_cf and use a naive lower-casing approach for the plain field (replace nfck_cf with [ nfkc, lowercase] for plain)?

@dcausse's patch above was accepted about a week ago, so the ability to exclude certain characters from other kinds of normalization will be improved in Elasticsearch 6—but that's a little far off. I don't know if it's possible to unpack the German language analyzer and/or mess with the German config to get the desired result in the meantime.

It's not clear to me how we would ideally want to treat ß. I know there are potential complexities related to the 1996 spelling reform, Swiss and Austrian variants, and words that have both ß and ss in different forms of the word (like beißen / bissen).

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)?

Also, as an expensive workaround to the problem of searching for daß but not dass you can use insource. The simplest version would be insource:/[dD]aß/ , which would match daß or Daß in any part of a word. You could try to do your own word boundary detection with something like insource:/[ .,;:'"!?][Dd]aß[ .,;:'"!?]/ , but Erik may come and give you a stern talking to for overtaxing the Elastic cluster.

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)? Yes, because quoted mean exact match.

TJones mentioned this in T104814: Appropriately ignore diacritics for German-language wikis.Jun 27 2017, 9:38 PM

TJones moved this task from This Quarter to Tech Debt/Misc on the Discovery-Search board.Oct 24 2017, 5:31 PM

Aklapper merged a task: T182447: Search in "" does not distinguish between "ss" and "ß".Dec 8 2017, 6:04 PM

Aklapper mentioned this in T182447: Search in "" does not distinguish between "ss" and "ß".

Aklapper added a subscriber: JStrodt_WMDE.

Lea_WMDE mentioned this in T182856: Add current issues to "exactly this text" helptext.Feb 15 2018, 4:09 PM

A related request was made on-wiki by a friendly IP editor: https://www.mediawiki.org/wiki/Topic:Updgir90a9m8tij9

debt moved this task from Tech Debt/Misc to Language Stuff on the Discovery-Search board.Jan 29 2019, 6:36 PM

TJones mentioned this in T226812: de.wikipedia: search for "Bedusz" does not find "Będusz".Jul 8 2019, 8:47 PM

CKoerner_WMF unsubscribed.Dec 12 2019, 4:29 PM

TJones raised the priority of this task from Low to Medium.Aug 27 2020, 8:01 PM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 9:09 PM