Page MenuHomePhabricator

~"daß" should not match "dass"
Closed, ResolvedPublic

Description

Quotes turn on exact term matches. But searching for ~"daß" also finds pages containing "dass" on de.wikipedia. Compare this to the behaviour of Google which finds 28.200 "daß" and 1.630.000 "dass" on site:de.wikipedia.org.

Event Timeline

FriedhelmW raised the priority of this task from to Needs Triage.
FriedhelmW updated the task description. (Show Details)
FriedhelmW added a project: CirrusSearch.
FriedhelmW subscribed.

Another example: ~"Maße" and ~"Masse" (T90089). They have different meaning in German.

FriedhelmW renamed this task from ~"daß" matches "dass" to ~"daß" should not match "dass".Jan 23 2015, 7:23 AM
FriedhelmW set Security to None.

@FriedhelmW: Did you reproduce this problem on a local MediaWiki instance with its default search backend, or why did you add the project "MediaWiki-Search" to this task? Clarifying comment welcome. :)

Now I tested on an other wiki, and MediaWiki-Search is not affected. Thank you for clarifying!

This sounds like a unicode normalization issue. Some details can be found on the Elasticsearch site. I thought we were using [[https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Maintenance/AnalysisConfigBuilder.php#L229-L234|nfkc]] normalization which should not in my understanding normalize ß to ss but it looks like that is what is happening here and in T90089.

When using quotes, no normalization or stemming must be done (except for converting to lower case). Of course, there needs to be an unnormalized index, too.

bd808 lowered the priority of this task from Medium to Low.Feb 25 2015, 7:43 PM
Restricted Application added a subscriber: Luke081515. · View Herald Transcript
debt subscribed.

We'd need to update the custom analysis chain to not normalize this.

when using quotes in cirrussearch the exact same fields (specific analyzed versions of text) in elasticsearch are queried. Without moving the parsing into php we don't really have control over how that happens.

Filed https://github.com/elastic/elasticsearch/pull/20814
Not sure it's the right approach for us but It will give us a chance to blacklist some chars which is not currently possible.
The reason ß is folded to ss is because nfkc_cf is a case folding technique which is designed to group words into an unique form regardless of the input:
DASS => dass
daß => dass

The question is: should we track a per language list of chars to exclude from nfkc_cf normalization or should we simply disable nfkc_cf and use a naive lower-casing approach for the plain field (replace nfck_cf with [ nfkc, lowercase] for plain)?

@dcausse's patch above was accepted about a week ago, so the ability to exclude certain characters from other kinds of normalization will be improved in Elasticsearch 6—but that's a little far off. I don't know if it's possible to unpack the German language analyzer and/or mess with the German config to get the desired result in the meantime.

It's not clear to me how we would ideally want to treat ß. I know there are potential complexities related to the 1996 spelling reform, Swiss and Austrian variants, and words that have both ß and ss in different forms of the word (like beißen / bissen).

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)?

Also, as an expensive workaround to the problem of searching for daß but not dass you can use insource. The simplest version would be insource:/[dD]aß/ , which would match daß or Daß in any part of a word. You could try to do your own word boundary detection with something like insource:/[ .,;:'"!?][Dd]aß[ .,;:'"!?]/ , but Erik may come and give you a stern talking to for overtaxing the Elastic cluster.

Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)? Yes, because quoted mean exact match.

TJones raised the priority of this task from Low to Medium.Aug 27 2020, 8:01 PM

Change 692700 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/692700

This is getting fixed as a side effect of unpacking the German analyzer in T281379.

Change 692700 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack German, Portuguese, and Dutch Elasticsearch Analyzers

https://gerrit.wikimedia.org/r/692700