German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thiemowmde
	Oct 7 2016, 12:40 PM

Description

I found this on accident while discussing at https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Suche_nach_.C3.9F.

[insource:/außer/](https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=insource%3A%2Fau%C3%9Fer%2F&fulltext=1&profile=) returns nothing.
[insource:/au[ß]er/](https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=insource%3A%2Fau%5B%C3%9F%5Der%2F&fulltext=1&profile=) returns nothing.
[insource:/au[ ß]er/](https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=insource%3A%2Fau%5B+%C3%9F%5Der%2F&fulltext=1&profile=) returns the expected results for "außer".

I have no idea why. What does the space do? I tried other combinations of characters, e.g. [ a], but in most cases this does what I expect. There is something very, very special about the German "ß".

Details

	Subject	Repo	Branch	Lines +/-
	Regex: Add a post analysis after trigram extraction	search/extra	master	+114 -15

Customize query in gerrit

Related Objects

Mentioned In: T160948: Deploy new versions of Wikimedia and other Elasticsearch plugins
Mentioned Here: T87136: ~"daß" should not match "dass"

Event Timeline

thiemowmde created this task.Oct 7 2016, 12:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2016, 12:40 PM

dcausse subscribed.Oct 7 2016, 12:42 PM

Aklapper updated the task description. (Show Details)Oct 7 2016, 1:01 PM

Probably because the lowercase filter (icu nfkc_cf) will fold ß into ss. It breaks the generated trigram into (at index time):

auss
usse
sser

and are no longer trigrams.

For insource:/au[ ß]er/ at (query time) it's probable that the generated trigrams will be:
(au_ | auß)
(u_e | u_ß)
(_er | ßer)

Thus the accelerated trigram query will certainly match on au_ | u_e | _er
Then the second pass (a classic regex) will match properly.

For insource:/außer/ I think that we generate
auß
uße
ßer

But I suspect that we do no not run any analysis on top of these and they we will never match.
I'll double check, I think the fix would be to run the same filters that we use at index time.

bd808 unsubscribed.Oct 7 2016, 6:58 PM

Smalyshev added a subscriber: TJones.Oct 7 2016, 8:22 PM

Andim subscribed.Oct 8 2016, 12:56 PM

Boshomi subscribed.Oct 8 2016, 2:17 PM

debt triaged this task as Medium priority.Oct 11 2016, 5:27 PM

debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Confirmed, the problem is due to inconsistencies in the way we apply lowercasing/casefolding algorithm in the regex.
Basically we use too many different methods to lowercase content and search query in the regex engine:

at index time we run ICU case_folding on top of trigrams (ß => ss)
at query time we run String.toLowerCase() with locale set to the wiki language
at recheck time we do some optimizations and apply different techniques according to the language

I think it's too complex and we should keep only a naive lowercase method (code point by code point with Character.toLowerCase())
We then just need to fold few chars that are ambiguous:

U+03BC <= U+00B5 (μ <= µ)
U+0069 <= U+0131 (i <= ı)
U+0073 <= U+017F (s <= ſ)
U+03B9 <= U+0345 (ι <= ͅ)
U+03C3 <= U+03C2 (σ <= ς)
U+03B2 <= U+03D0 (β <= ϐ)
U+03B8 <= U+03D1 (θ <= ϑ)
U+03C6 <= U+03D5 (φ <= ϕ)
U+03C0 <= U+03D6 (π <= ϖ)
U+03BA <= U+03F0 (κ <= ϰ)
U+03C1 <= U+03F1 (ρ <= ϱ)
U+03B5 <= U+03F5 (ε <= ϵ)
U+1E61 <= U+1E9B (ṡ <= ẛ)
U+03B9 <= U+1FBE (ι <= ι)

There are ambiguous char where when we lower case the result of Character.toUpperCase() we obtain a different codePoint:

int codePoint = 0x0131;
assert Character.toLowerCase(Character.toUpperCase(codePoint)) != codePoint;

This is to match the behavior of java Pattern with Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE flags (code)

This is tough but I think we should keep things simple and do not try to do complex (locale aware) case folding, in short I'd suggest to support only Simple case insensitive matching described here.

Thanks a lot for looking into this! Very much appreciated!

Change 342241 had a related patch set uploaded (by DCausse):
[search/extra] Regex: Add a post analysis after trigram extraction

https://gerrit.wikimedia.org/r/342241

gerritbot added a project: Patch-For-Review.Mar 10 2017, 5:08 PM

dcausse claimed this task.Mar 10 2017, 5:23 PM

dcausse edited projects, added Discovery-Search (Current work); removed Discovery-Search.

dcausse moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

This is progressing. A new version of the patch was uploaded and is in review. No blockers.

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Mar 20 2017, 8:33 PM

EBernhardson mentioned this in T160948: Deploy new versions of Wikimedia and other Elasticsearch plugins.

Change 342241 merged by jenkins-bot:
[search/extra] Regex: Add a post analysis after trigram extraction

https://gerrit.wikimedia.org/r/342241

• Deskana closed this task as Resolved.Mar 22 2017, 11:23 AM

Thanks for the help.

Currently GrepCode is down for last 4-5 months and the code url you have shared : http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/regex/Pattern.java#3344

does not work anymore

I have the updated link for the source code : Request you to use this one.
https://zgrepcode.com/java/oracle/jdk-8u181/java/util/regex/pattern.java

Regards

German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressionsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
Closed, ResolvedPublic
Actions