Expanding on the regex rewriting we did in T317599 we should also be able to support some simple escape sequences which will allow users to more easily search for multiline content. The inclusion of \u should allow users to search for any character that is inconvenient to directly include in the search itself.
Description
Details
| Title | Reference | Author | Source Branch | Dest Branch | |
|---|---|---|---|---|---|
| Bump plugins to 1.3.20-9 | repos/search-platform/cirrussearch-opensearch-image!18 | ebernhardson | work/ebernhardson/plugins-1.3.20-9 | main | |
| Update plugins for regex syntax | repos/search-platform/opensearch-plugins-deb!7 | ebernhardson | work/ebernhardson/regex-update | master |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | EBernhardson | T403212 Support \r, \n, \t, and \uNNNN in insource and intitle queries | |||
| Resolved | RKemper | T403749 Install new wmf-opensearch-search-plugins package/roll-restart CirrusSearch clusters |
Event Timeline
Change #1182923 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[wmf-jvm-utils@master] regex: Support escape code expansion
(i'm guessing this isn't ready to announce yet given that the patch isn't currently merged?)
Change #1182923 merged by jenkins-bot:
[wmf-jvm-utils@master] regex: Support escape code expansion
Change #1184587 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184588 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184587 merged by jenkins-bot:
[search/highlighter@master] Bump lucene-regex-rewriter version to 1.0.6
Change #1184588 merged by jenkins-bot:
[search/extra@master] Bump lucene-regex-rewriter version to 1.0.6
ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7
Update plugins for regex syntax
ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/opensearch-plugins-deb/-/merge_requests/7
Update plugins for regex syntax
Indeed it'll take another week or so. Once the subtask to roll-restart the search clusters is complete the functionality should be fully available.
ebernhardson opened https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18
Bump plugins to 1.3.20-9
ebernhardson merged https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image/-/merge_requests/18
Bump plugins to 1.3.20-9
@EBernhardson Thanks for adding these characters to the docs!
Re the \uNNNN matching, do you think it's worth clarifying the term "surrogate pairs" in those docs? I ask as it took me a few minutes to work out that — to search for the equivalent of intitle:/💔/ — I needed to use intitle:/\uD83D\uDC94/ (instead of e.g. just using intitle:/\u1F494/). Or do you think that people using this escape character would generally be expected to know what this would be referring to? (genuine question)
We discussed in the team a bit before releasing this if \uHHHH was the best format, or if we should go with something else. The basic reasoning here was:
- JSON / javascript use the \uHHHH syntax, we took this to mean that it is a widely understood syntax and shouldn't be too hard for users to find related documentation / examples.
- PCRE2 (common regex library) also makes the \uHHHH syntax available, but only when compiled with specific flags. This is documented as being done because users were so familiar with/expected the JSON style notation
- The search engine accepts multi-byte utf8 characters natively. As shown, we can directly intitle:/💔/ without using the \u escaping. The intended use case for \u escaping is more for unprintable characters, but also includes things that are otherwise inconvenient to type directly.
I don't necessarily think people will know about surrogate pairs just because they are familiar with \u syntax. Surrogate pairs are a bit of a niche technical detail that i suspect many engineers have only a passing familiarity with, with less-technical users being even less familiar. I've added a link to the enwiki section of the UTF-16 encoding article about surrogate pairs, but I'm not sure how to otherwise describe them without going into significantly more detail than is appropriate there.
Support \r, \n, \t, and \uNNNN in insource and intitle queries
Open, Needs Triage
Why is this bug still open given the fix was announced 11 days ago?
Also I wonder what is the point with supporting CR AKA \r given that wiki AFAIK uses LF-text only.
Anyway, if this fix really works then I probably will be able to delete this
VGSTE3MP = "[^0-9A-Za-z\}\{ \|\=\[\]\'\#\.\:\*]{1}"
silly code from my bot.
@EBernhardson Fair enough. Thank you for the explanation, and for adding the link!
I'm guessing not for right now, but - for the future - I wonder if it's worth supporting the use of curly-braces with the \u escape code, in a way like \u{HHHHH} (that can be used with characters like \u{1F494})? I haven't looked into it deeply, but at first glance it seems like this is something that's supported by JavaScript (using \u{1F494}) & by PCRE2 (using \x{1F494}).
(I can create a separate task for it if you think it might be worth considering?)
I noticed that surrogate pairs can't be used inside a character class. For example, both insource:/😂/ and insource:/[😂]/ work, but insource:/[\uD83D\uDE02]/ returns nothing. Using it in character classes can be handy when searching for a range of Unicode code points. Shall I file a new ticket for this?
To A_smart_kitten: Surrogate code is an unavoidable peculiarity in systems using UTF-16 Unicode strings. That include Windows NT, Java, .NET and JavaScript (so JSON). Whether it should be exposed in the user interface is debatable. I have seen it in a document elsewhere that teaches users to do the encoding themselves: https://npp-user-manual.org/docs/searching/#match-by-character-code
Thanks for the report, indeed there is a problem with the way we are expanding the multi-byte sequences. I've put up a patch that will fix it, but it will typically be a few weeks between merging and deployment for the search plugins.
Change #1205170 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[wmf-jvm-utils@master] regex: Fix expansion of multibyte characters
@EBernhardson apologies for bumping this, but do you think it might be worth me filing a follow-up feature request for this, given the similar usage/support in other regex flavours/libraries?
Beyond the syntaxes used by other implementations, you can use UTS #18: Unicode Regular Expressions (http://www.unicode.org/reports/tr18/ §1.7 Code Points) to back your feature request. I am not sure how convincing it is.
Change #1206437 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[wmf-jvm-utils@master] regex: Support alternate \u{...} syntax
Change #1205170 merged by jenkins-bot:
[wmf-jvm-utils@master] regex: Fix expansion of multibyte characters
That looks nice, thanks! (Regarding the commit message,) I would not call the parameter to \u{N} as "UTF-32", because people write \u{1F602} instead of \u{0001F602} (suppose UTF-32BE is used). That thing should just be called "code point", irrespective of the encoding being used.
I wish those dreaded surrogate pairs can be hidden from the user interface with this new syntax in place. I admit that's not a valid reason to advocate for the removal of an existing feature.
Change #1206437 merged by jenkins-bot:
[wmf-jvm-utils@master] regex: Support alternate \u{...} syntax
Change #1207264 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/extra@master] regex: Bump lucene-regex-rewriter to 1.0.7
Change #1207266 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):
[search/highlighter@master] regex: Bump lucene-regex-rewriter to 1.0.7
Change #1207266 merged by jenkins-bot:
[search/highlighter@master] regex: Bump lucene-regex-rewriter to 1.0.7
Change #1207264 merged by jenkins-bot:
[search/extra@master] regex: Bump lucene-regex-rewriter to 1.0.7
@EBernhardson This seems to throw timeout reports, see this dewiki discussion: https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Zeit%C3%BCberschreitung_bei_Insource-Suche
Is it only that query, or is it also performing generally worse? We have documented that this form of query is expected to time out, particularly on wikis of decent size like dewiki. The suggested variation insource:/Dremel/ insource:dremel returns results in < 1s and should be equivalent. At a general level the changes made in this ticket, which is a pre-processing step that transforms the regex, doesn't look to make any change to this example query.
I can also see in our metrics expensive query usage is up last week, typically its ~1/sec but it's been hitting the limiter at ~10/sec. It looks like whoever was issuing those queries has stopped, but if it's an ongoing issue we can look closer into them and see if they can be moved into the Automated bucket which has separate limits from normal search. The "too many regular expression searches" error occurs when this bucket fills up with concurrent searches.
Is it only that query, or is it also performing generally worse?
Not only, this has a very short timeout too: https://de.wikipedia.org/wiki/Spezial:API-Spielwiese#action=query&format=json&list=search&pageids=1&formatversion=2&srsearch=insource%3A%2FDremel%2F&srnamespace=0
Hmm, that does seem likely. If we add a &cirrusDumpQuery to one of the searches we can see it has timeout: 15s, when indeed regex should get a longer timeout. Not sure yet what changed to cause that.