Page MenuHomePhabricator

German ß (sharp s, eszett) triggers confusing behavior in insource: regular expressions
Closed, ResolvedPublic

Description

I found this on accident while discussing at https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Suche_nach_.C3.9F.

I have no idea why. What does the space do? I tried other combinations of characters, e.g. [ a], but in most cases this does what I expect. There is something very, very special about the German "ß".


See Also: T87136: ~"daß" should not match "dass"

Event Timeline

Probably because the lowercase filter (icu nfkc_cf) will fold ß into ss. It breaks the generated trigram into (at index time):

  • auss
  • usse
  • sser

and are no longer trigrams.

For insource:/au[ ß]er/ at (query time) it's probable that the generated trigrams will be:
(au_ | auß)
(u_e | u_ß)
(_er | ßer)

Thus the accelerated trigram query will certainly match on au_ | u_e | _er
Then the second pass (a classic regex) will match properly.

For insource:/außer/ I think that we generate
auß
uße
ßer

But I suspect that we do no not run any analysis on top of these and they we will never match.
I'll double check, I think the fix would be to run the same filters that we use at index time.

debt triaged this task as Medium priority.Oct 11 2016, 5:27 PM
debt moved this task from needs triage to This Quarter on the Discovery-Search board.

Confirmed, the problem is due to inconsistencies in the way we apply lowercasing/casefolding algorithm in the regex.
Basically we use too many different methods to lowercase content and search query in the regex engine:

  • at index time we run ICU case_folding on top of trigrams (ß => ss)
  • at query time we run String.toLowerCase() with locale set to the wiki language
  • at recheck time we do some optimizations and apply different techniques according to the language

I think it's too complex and we should keep only a naive lowercase method (code point by code point with Character.toLowerCase())
We then just need to fold few chars that are ambiguous:

  • U+03BC <= U+00B5 (μ <= µ)
  • U+0069 <= U+0131 (i <= ı)
  • U+0073 <= U+017F (s <= ſ)
  • U+03B9 <= U+0345 (ι <= ͅ)
  • U+03C3 <= U+03C2 (σ <= ς)
  • U+03B2 <= U+03D0 (β <= ϐ)
  • U+03B8 <= U+03D1 (θ <= ϑ)
  • U+03C6 <= U+03D5 (φ <= ϕ)
  • U+03C0 <= U+03D6 (π <= ϖ)
  • U+03BA <= U+03F0 (κ <= ϰ)
  • U+03C1 <= U+03F1 (ρ <= ϱ)
  • U+03B5 <= U+03F5 (ε <= ϵ)
  • U+1E61 <= U+1E9B (ṡ <= ẛ)
  • U+03B9 <= U+1FBE (ι <= ι)

There are ambiguous char where when we lower case the result of Character.toUpperCase() we obtain a different codePoint:

int codePoint = 0x0131;
assert Character.toLowerCase(Character.toUpperCase(codePoint)) != codePoint;

This is to match the behavior of java Pattern with Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE flags (code)

This is tough but I think we should keep things simple and do not try to do complex (locale aware) case folding, in short I'd suggest to support only Simple case insensitive matching described here.

Thanks a lot for looking into this! Very much appreciated!

Change 342241 had a related patch set uploaded (by DCausse):
[search/extra] Regex: Add a post analysis after trigram extraction

https://gerrit.wikimedia.org/r/342241

This is progressing. A new version of the patch was uploaded and is in review. No blockers.

Change 342241 merged by jenkins-bot:
[search/extra] Regex: Add a post analysis after trigram extraction

https://gerrit.wikimedia.org/r/342241

Thanks for the help.

Currently GrepCode is down for last 4-5 months and the code url you have shared : http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/regex/Pattern.java#3344

does not work anymore

I have the updated link for the source code : Request you to use this one.
https://zgrepcode.com/java/oracle/jdk-8u181/java/util/regex/pattern.java

Regards