Page MenuHomePhabricator

Timeouts searching for terms and regular expressions too low
Closed, ResolvedPublicBUG REPORT

Description

Due to the timeout problems after closing the parent task T403212 and task T410007 let me allow to open a subtask forking following bug by quoting of the last conversation there:

Is it only that query, or is it also performing generally worse? We have documented that this form of query is expected to time out, particularly on wikis of decent size like dewiki. The suggested variation insource:/Dremel/ insource:dremel returns results in < 1s and should be equivalent. At a general level the changes made in this ticket, which is a pre-processing step that transforms the regex, doesn't look to make any change to this example query.

I can also see in our metrics expensive query usage is up last week, typically its ~1/sec but it's been hitting the limiter at ~10/sec. It looks like whoever was issuing those queries has stopped, but if it's an ongoing issue we can look closer into them and see if they can be moved into the Automated bucket which has separate limits from normal search. The "too many regular expression searches" error occurs when this bucket fills up with concurrent searches.

I think, this problem has a little relation to T410007 . Maybe @dcausse can help here again?

Hmm, that does seem likely. If we add a &cirrusDumpQuery to one of the searches we can see it has timeout: 15s, when indeed regex should get a longer timeout. Not sure yet what changed to cause that.

Let's find a solution to raise the timeout, that at least Cirrus and API searches can work properly in a way that we expect. I know, dewiki is not that small wiki in the world, but it must be allowed, to search phrases and regular expressions, in order to work with it properly.

Thank you very much in advance

Event Timeline

This is not a subtask; please review project tags and subscribers in such cases - thanks a lot!

@Pppery : You removed MediaWiki-Action-API, but I guess API action=query&list=search is a MediaWiki API action. This is mentioned in the task description too.

MediaWiki-Action-API description says:

In general however, tasks about specific APIs should instead be filed instead under the component (e.g. use MediaWiki-Recent-changes for issues with the recentchanges API).

@taavi: Okay, I guess you mean MediaWiki-Search as API component.

We have documented https://www.mediawiki.org/wiki/Help:CirrusSearch#Regular_expression_searches that
this form of query is expected to time out, particularly on wikis of decent size like dewiki. The
suggested variation insource:/Dremel/ insource:dremel returns results in < 1s and should be equivalent

Indeed it is documented. But how to search for

||}}{{de|

on a wiki where "de" is a common preposition or pronoun? The stupid indexed search will ignore everything except the "de", and since "de" is on every page, it will not generate any "domain".

Maybe wikis would need a simple plain-text search, without all the flawed indexed "magic", and without the resource-hoggy regex magic.

CirrusSearch has to be careful when specifying timeouts of a regex query.
Regex queries are particularly costly and may cause a lot of stress on the servers if not properly protected.
The 15s timeouts has been setup for this, to ensure that the search backend return before any other timeouts are applied otherwise this might mean that a costly query will continue to run outside of the concurrency protection (T152895).
Unless you noticed that that the regex got a lot slower recently and that more queries are timing out I think it is safer to keep the 15s internal timeout.

Maybe wikis would need a simple plain-text search, without all the flawed indexed "magic", and without the resource-hoggy regex magic.

A search index has to work over an inverted index which requires the text to be tokenized. By simple plain-text search I guess that you mean something like Ctrl-F in a text application or grep on the command line, you can do this by downloading the dumps and running grep on your machine but I suspect that you'll notice that it is very slow, way too slow for it to be a feature we offer from the wiki websites.

Unless you noticed that that the regex got a lot slower recently and that more queries are timing out I think it is safer to keep the 15s internal timeout.

But Wikipedia content is more and more increasing and it will reach one time that 15s are not safer any more. What then? Shall we decrease the timeout more and more? I guess we need a search machine that is able to handle all the big content properly if CirrusSearch does not.

Unless you noticed that that the regex got a lot slower recently and that more queries are timing out I think it is safer to keep the 15s internal timeout.

But Wikipedia content is more and more increasing and it will reach one time that 15s are not safer any more. What then? Shall we decrease the timeout more and more? I guess we need a search machine that is able to handle all the big content properly if CirrusSearch does not.

I think there's a misunderstanding about what this 15s timeout is about, we won't decrease it because the content grows, but we can't increase it (without proper verification) to account for the content growth. To limit the impact of the content growth we generally reshard the index but this is a different issue.

@doctaxon did you identify queries that were successfully running before but are now failing constantly?

If we want to increase the 15s internal timeout we need to make sure that we won't leak compute threads outside the poolcounter protection. We may run some analysis on the backend query logs to see the variance we have between the backend response time and this 15s shard timeout. If this variance is still high we don't have much room to increase it.

Indeed it is documented. But how to search for

||}}{{de|

on a wiki where "de" is a common preposition or pronoun? The stupid indexed search will ignore everything except the "de", and since "de" is on every page, it will not generate any "domain".

Maybe wikis would need a simple plain-text search, without all the flawed indexed "magic", and without the resource-hoggy regex magic.

As an aside, a properly escaped insource regex search will find this quicky if the exact string (including the symbols) is not super common, regardless of how the text analysis would treat the text: insource:/\|\|\}\}\{\{de/. All the extra slashes escape the characters that are regex syntax. I dropped the final pipe because none of the wikis I tested on got any results, but it shouldn't be any different.

The regex magic includes a trigram index of the article text to accelerate the search. From this query it can extract the trigrams ||}, |}}, }}{, }{{, {{d, and {de and it limits itself to articles with those trigrams before going back to the more expensive regex matching, which is basically grep. It can often extract trigrams from simple regex syntax like /ab?cd/ or /a[bc]d/, but not always from really complex regexes, or regexes that don't match at least three characters in a row.

On enwiki: insource:/\|\|\}\}\{\{de/. It responds super quick on French, Spanish, and German Wikipedias, too. (Though Spanish and German get zero results.)

If you were actually looking for the string that matches (with "dead link" as on enwiki), you could search for this: insource:/\|\|\}\}\{\{dead/ "dead" to possibly limit the regex domain even more.

Crafting good regex queries that don't time out is an art, and it isn't always possible and some complex searches will always time out.

I have the following problematic patterns:

insource:/\<\<[a-z\/]/i which is the search for fixes like here: https://de.wikipedia.org/w/index.php?title=Kronowo_%28Barczewo%29&diff=261268272&oldid=260588753

insource:"ISBN" insource:/isbn[ 0-9]+\–/i Note that the second "–" is EN DASH (U+2013) in deWp we have this magic keyword ISBN which accepts only the ascii "-" HYPHEN-MINUS (U+002D)

We're off-topic for this ticket, but I'll reply to these here. @dcausse, do you have the link for the on-wiki help with regex searches you mentioned?

insource:/\<\<[a-z\/]/i which is the search for fixes like here: https://de.wikipedia.org/w/index.php?title=Kronowo_%28Barczewo%29&diff=261268272&oldid=260588753

There is code to expand simpler regexes into trigrams to use a trigram index to accelerate searches. I think [a-z\/] is too complex (it has too many options). You could break it into pieces: insource:/\<\<[abc]/ (link) is able to extract the trigrams <<a, <<b, and <<c and returns immediately. On the other hand, insource:/\<\<[abcde]/ times out and may get partial results. It's tedious, but doable. (I also dropped the /i case-insensitive option because it creates more trigrams which you probably mostly don't need.)

If this is an ongoing effort, you can avoid previously reviewed examples using lasteditdate, though it may not decrease the search domain enough to allow the full [a-z\/] character class after more than a few days:
insource:/\<\<[abc]/ lasteditdate:>=2025-12-01

insource:"ISBN" insource:/isbn[ 0-9]+\–/i Note that the second "–" is EN DASH (U+2013) in deWp we have this magic keyword ISBN which accepts only the ascii "-" HYPHEN-MINUS (U+002D)

This is just complex... there's no real getting around it. I'd suggest using ISBN in the regex and dropping the case-insensitive /i unless you think that's necessary. You should also add - to the character class—right now it'll only match en dashes if they are after the first batch of numbers.

So you currently match ISBN 978–3-404-18941-0, but not ISBN 3-87294-265-4–2. I also saw ISBN–3–486-56200-2, which is funky, but probably warrants fixing. So, my suggested regex (which still times out) is insource:"ISBN" insource:/ISBN[ 0-9-]*–[0-9]/—the trailing [0-9] also prevents matching trailing after the ISBN (like ISBN 3-406-52203-3 –).

Also, if possible, I suggest updating the ISBN magic keyword to allow en dashes, just because you onlly have to fix that once. I have no idea if that's feasible, but I did something similar for my pet peeve, homoglyphs, and it's so much easier in the long run.

Finally, I used this regex (which definitely times out) to find more characters beside en dash that appear in ISBNs: insource:"ISBN" insource:/ISBN[ 0-9-]*[0-9][^ =0-9-][0-9]/. It gets plenty of false hits, but I see . (period), _ (underscore), and (hyphen, U+2010)... and "ISBN 3-9256o8-02-8" with a lowercase "o"!

insource:/\<\<[a-z]/i lasteditdate:>=2025-12-01

works
and a sequence of

insource:/\<\<[AaBb]/

with "AaBb" iterating till "YyzZ" works too without any timeout.

I will address the ISBN-search later

TJones claimed this task.

For now the timeouts are set to reasonable limits, in terms of what our infrastructure can support.

There has been a lot of discussion here and some updates to documentation to help people build better queries that are less likely to fail, though it will probably continue to be a problem for those new to regex queries.