Page MenuHomePhabricator

Regular expressions timeout in es2.x
Closed, ResolvedPublic

Description

Scenario: insource:// for other complex regexes finds answers and doesn't spin forever # features/insource_api.feature:104
  When I api search for all:insource:/[ab]*a[cd]{50,80}/                               # features/step_definitions/search_steps.rb:30
    Timeout::Error (Faraday::TimeoutError)

It's possible this is a problem with our test runner, previously we've used a large VM to run these and the current test is runnin, but i'm dubious. cirrus-browser-bot.eqiad.wmflabs running 1.7 returns a result in <1s, while searchdemo.eqiad.wmflabs running 2.3 takes 49s. A factor of 50 increase seems worth looking into.

es 1.7: http://cirrustest-cirrus-browser-bot.wmflabs.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=all%3Ainsource%3A%2F%5Bab%5D%2Aa%5Bcd%5D%7B50%2C80%7D%2F&searchToken=3me4v1tohbkbf2irbhp5yv909
es 2.3: http://cirrustest-searchdemo.wmflabs.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=all%3Ainsource%3A%2F%5Bab%5D%2Aa%5Bcd%5D%7B50%2C80%7D%2F&searchToken=f3eurk9z7itlhzjoz8yobbvvy

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : es2.xAdd support for max_ngram_clauses

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptApr 26 2016, 10:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

yes probably a bug I introduced in the plugin

Damn this one will be complex... this regex generates more than 20000 boolean clauses.
Previously we used BooleanFilters (also 20000 generated)...
I don't know yet where the is, probably at multiple level. I should maybe not rewrite the approximation (was not rewritten before... but it somewhat breaks the lucene contract)
Will continue to investigate...

I optimized the regex and won few secs but it still ~30sec.
I added a new option to use the deprecated filters and now it takes 3sec (&cirrusRegexUseFilters=yes)...
I'm afraid that all the boolean logic to create weights and scorers is not appropriate for this kind of monstruous boolean clauses.
I don't know what to do here... I'll continue to have a look.

I imagine actual usage of something as bad as this has got to be minimal from real users. It would be nice to provide the possibility of doing crazy things, but as a fallback we could protect our cluster by installing some new limits to how many clauses can be generated?

I suppose we could also file a bug with elasticsearch if their unification has had a 10x change in performance of a particular stlye of query

It's directly related to lucene filter removal refactoring. I suppose that all the extra logic to build the boolean tree benefits other use cases, it is now able to do complex optimized doc collection depending on how many docs will return a clause.
Also the number of boolean clause is limited to 1024 by default, here we build a complex tree bypassing the limitation.
I suppose I can add a new param "transformer" and allow to set "boolean" or "filters". This will allow us to switch between the two implementations and we'll still be able to use the deprecated code.

Concerning limiting the number of boolean clauses generated I realize that we have maxNgramsExtracted set to 100 so wether it's a bug and this param is not working as expected or maybe it's only supposed to limit the number of unique ngrams extracted.
I'll continue to investigate as it seems Nik thought about that with this param (https://gerrit.wikimedia.org/r/#/c/171459/).

dcausse added a comment.EditedApr 29 2016, 3:31 PM

OK by degrading such monstruous clauses perfs are better.
The technique I used is to degrade the boolean expression into a flat disjunction. It's not clear if the optimization will still be worthwhile since we completely break the smart expression built from the automaton.
I think it's what Nik tried to do (in a smarter way) when he limited the number of ngram transitions generated (https://gerrit.wikimedia.org/r/#/c/171459/). It works but unfortunately there are some loops in the transition that may still explode when generating the boolean expression. I don't know if it's an exponential explosion but the ngram automaton has 120 transitions and will generate 38000 boolean clauses.
There is maybe a smarter way to do this and work only at the automaton level or degrade/optimize the boolean query but I think it's starting to be too complex for me :)

The number of clauses has been limited to 1024 (same as boolean query defaults), if it's higher we will try the degraded flat or.

https://gerrit.wikimedia.org/r/#/c/285955/ is deployed on searchdemo, the regex now takes less than 1sec.
I need to create the pending patch in cirrus to include this new attribute.
I will fetch some insource queries from hive logs and check how many would be degraded and adjust the defaults if needed.

I will probably remove the old Filters code I re-added.

Change 286391 had a related patch set uploaded (by DCausse):
Add support for max_ngram_clauses

https://gerrit.wikimedia.org/r/286391

Over 1000 insource queries extracted from hive only 13 generate more than 1024 clauses: P2982

Change 286391 merged by jenkins-bot:
Add support for max_ngram_clauses

https://gerrit.wikimedia.org/r/286391

Deskana closed this task as Resolved.May 11 2016, 10:36 PM
Deskana triaged this task as Normal priority.