Quotes turn on exact term matches. But searching for ~"daß" also finds pages containing "dass" on de.wikipedia. Compare this to the behaviour of Google which finds 28.200 "daß" and 1.630.000 "dass" on site:de.wikipedia.org.
- Mentioned In
- T226812: de.wikipedia: search for "Bedusz" does not find "Będusz"
T182856: Add current issues to "exactly this text" helptext
T182447: Search in "" does not distinguish between "ss" and "ß"
T104814: Appropriately ignore diacritics for German-language wikis
T147636: German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
T87112: Wrong stemming in German
T90089: ~"Maße" should not match "Masse"
- Mentioned Here
- T90089: ~"Maße" should not match "Masse"
This sounds like a unicode normalization issue. Some details can be found on the Elasticsearch site. I thought we were using [[https://github.com/wikimedia/mediawiki-extensions-CirrusSearch/blob/master/includes/Maintenance/AnalysisConfigBuilder.php#L229-L234|nfkc]] normalization which should not in my understanding normalize ß to ss but it looks like that is what is happening here and in T90089.
when using quotes in cirrussearch the exact same fields (specific analyzed versions of text) in elasticsearch are queried. Without moving the parsing into php we don't really have control over how that happens.
Not sure it's the right approach for us but It will give us a chance to blacklist some chars which is not currently possible.
The reason ß is folded to ss is because nfkc_cf is a case folding technique which is designed to group words into an unique form regardless of the input:
DASS => dass
daß => dass
The question is: should we track a per language list of chars to exclude from nfkc_cf normalization or should we simply disable nfkc_cf and use a naive lower-casing approach for the plain field (replace nfck_cf with [ nfkc, lowercase] for plain)?
@dcausse's patch above was accepted about a week ago, so the ability to exclude certain characters from other kinds of normalization will be improved in Elasticsearch 6—but that's a little far off. I don't know if it's possible to unpack the German language analyzer and/or mess with the German config to get the desired result in the meantime.
It's not clear to me how we would ideally want to treat ß. I know there are potential complexities related to the 1996 spelling reform, Swiss and Austrian variants, and words that have both ß and ss in different forms of the word (like beißen / bissen).
Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)?
Also, as an expensive workaround to the problem of searching for daß but not dass you can use insource. The simplest version would be insource:/[dD]aß/ , which would match daß or Daß in any part of a word. You could try to do your own word boundary detection with something like insource:/[ .,;:'"!?][Dd]aß[ .,;:'"!?]/ , but Erik may come and give you a stern talking to for overtaxing the Elastic cluster.
Would it make sense to have ß normalize to ss for "regular" searches (i.e., in the stemmed "text" field), but remain "ß" for quoted searches (i.e., in the exact-match "plain" field)? Yes, because quoted mean exact match.