Page MenuHomePhabricator

~"daß" should not match "dass"
Open, LowPublic

Description

Quotes turn on exact term matches. But searching for ~"daß" also finds pages containing "dass" on de.wikipedia. Compare this to the behaviour of Google which finds 28.200 "daß" and 1.630.000 "dass" on site:de.wikipedia.org.

Event Timeline

FriedhelmW raised the priority of this task from to Needs Triage.
FriedhelmW updated the task description. (Show Details)
FriedhelmW added a project: CirrusSearch.
FriedhelmW added a subscriber: FriedhelmW.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 18 2015, 8:42 AM
FriedhelmW added a comment.EditedJan 22 2015, 5:55 AM

Another example: ~"Maße" and ~"Masse" (T90089). They have different meaning in German.

FriedhelmW renamed this task from ~"daß" matches "dass" to ~"daß" should not match "dass".Jan 23 2015, 7:23 AM
FriedhelmW set Security to None.

@FriedhelmW: Did you reproduce this problem on a local MediaWiki instance with its default search backend, or why did you add the project "MediaWiki-Search" to this task? Clarifying comment welcome. :)

Now I tested on an other wiki, and MediaWiki-Search is not affected. Thank you for clarifying!

bd808 added a subscriber: bd808.Feb 22 2015, 11:38 PM

This sounds like a unicode normalization issue. Some details can be found on the Elasticsearch site. I thought we were using nfkc normalization which should not in my understanding normalize ß to ss but it looks like that is what is happening here and in T90089.

FriedhelmW added a comment.EditedFeb 23 2015, 6:49 AM

When using quotes, no normalization or stemming must be done (except for converting to lower case). Of course, there needs to be an unnormalized index, too.

Manybubbles triaged this task as Normal priority.Feb 25 2015, 7:39 PM
bd808 lowered the priority of this task from Normal to Low.Feb 25 2015, 7:43 PM
Restricted Application added a project: Discovery. · View Herald TranscriptJan 12 2016, 9:13 PM
Deskana moved this task from Needs triage to Search on the Discovery board.Jan 16 2016, 1:37 AM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJul 10 2016, 2:32 PM
Restricted Application added a subscriber: Luke081515. · View Herald Transcript
debt added a subscriber: debt.

We'd need to update the custom analysis chain to not normalize this.

when using quotes in cirrussearch the exact same fields (specific analyzed versions of text) in elasticsearch are queried. Without moving the parsing into php we don't really have control over how that happens.

Filed https://github.com/elastic/elasticsearch/pull/20814
Not sure it's the right approach for us but It will give us a chance to blacklist some chars which is not currently possible.
The reason ß is folded to ss is because nfkc_cf is a case folding technique which is designed to group words into an unique form regardless of the input:
DASS => dass
daß => dass

The question is: should we track a per language list of chars to exclude from nfkc_cf normalization or should we simply disable nfkc_cf and use a naive lower-casing approach for the plain field (replace nfck_cf with [ nfkc, lowercase] for plain)?

TJones added a subscriber: TJones.Jun 23 2017, 10:33 PM

@dcausse's patch above was accepted about a week ago, so the ability to exclude certain characters from other kinds of normalization will be improved in Elasticsearch 6—but that's a little far off. I don't know if it's possible to unpack the German language analyzer and/or mess with the German config to get the desired result in the meantime.

It's not clear to me how we would ideally want to treat ß. I know there are potential complexities related to the 1996 spelling reform, Swiss and Austrian variants, and words that have both ß and ss in different forms of the word (like beißen / bissen).

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)?

Also, as an expensive workaround to the problem of searching for daß but not dass you can use insource. The simplest version would be insource:/[dD]aß/ , which would match daß or Daß in any part of a word. You could try to do your own word boundary detection with something like insource:/[ .,;:'"!?][Dd]aß[ .,;:'"!?]/ , but Erik may come and give you a stern talking to for overtaxing the Elastic cluster.

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)? Yes, because quoted mean exact match.

A related request was made on-wiki by a friendly IP editor: https://www.mediawiki.org/wiki/Topic:Updgir90a9m8tij9