Page MenuHomePhabricator

regex highlighting doesn't always timeout as expected
Closed, ResolvedPublic

Description

I was testing something on wikitech and issuing the query insource:/@covers/ times out as expected and returns partial results, but searching insource:/@covers/ covers doesn't time out appropriately and instead returns a query failure. Monitoring the task from elasticsearch tasks api shows it takes ~20 minutes to finish.

The actual regex is expensive because @ apparently matches "any string" in the lucene regex syntax and should have been escaped in my query. Still though the inability of the regex to abort seems concerning.

I started to investigate but yes the 20s shard timeout is not reached before the curl request timeout.
Looking at the stacktrace I failed to see one using the source_regex staying around for long...
Asking elastic to return 0 results (to disable the fetch phase) the time spent is reasonable (20-30sec) and the timeout is properly triggered:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q.json 
 {"took":33967,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":0.0,"hits":[]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
 real	0m33.989s
 user	0m0.012s
 sys	0m0.004s

Removing the highlight query and re-enabling the fetch_phase by asking to return 20 docs I get reasonable times as well:

dcausse@elastic1041:~$ time curl -XPOST 'localhost:9200/labswiki_content/page/_search?timeout=20s' -d @q_no_highlight.json 
{"took":35727,"timed_out":true,"_shards":{"total":1,"successful":1,"failed":0},"hits":{"total":11,"max_score":18.531656,"hits":[{"_index":"labswiki_content_1521968540","_type":"page","_id":"7559","_score":18.531656,"_source":{"redirect":[{"namespace":0,"title":"Packaging"}],"namespace_text":"","wiki":"labswiki"," [...] estamp":"2016-11-04T01:08:22Z"},"fields":{"text.word_count":[97744]}}]},"suggest":{"suggest":[{"text":"covers","offset":0,"length":6,"options":[]}]}}
real	0m35.750s
user	0m0.008s
sys	0m0.008s

So the problem is definitely in the highlighter.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
EBernhardson updated the task description. (Show Details)May 24 2018, 2:56 PM

Also worth noting that labswiki_content has 9k pages and contains only 163mb of primary data. Even running the regexp against everything should be reasonably quick, so something else is going on here. I'm suspicious, with no proof, that this is going to be related specifically to the @ in the query somehow causing an explosion in highlighting complexity.

Change 435282 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/highlighter@master] Improve regex highlighting performance

https://gerrit.wikimedia.org/r/435282

Change 435310 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/extra@master] Improve regex performance with leading wildcards

https://gerrit.wikimedia.org/r/435310

TJones added a subscriber: TJones.May 30 2018, 8:47 PM

Since Erik gathered all the regex search data and shared it with me, I couldn't help myself and I did a quick survey of it and put it in the usual place on MediaWiki. Three patterns of searches account for almost 94% of regex searches—interwiki links, URLs, and Library of Congress collection IDs.

Read more on MediaWiki.

Change 435310 merged by jenkins-bot:
[search/extra@master] Improve regex performance with leading wildcards

https://gerrit.wikimedia.org/r/435310

Change 435282 merged by jenkins-bot:
[search/highlighter@master] Improve regex highlighting performance

https://gerrit.wikimedia.org/r/435282

EBernhardson added a comment.EditedJun 1 2018, 4:49 PM

The root cause of the query that triggered this issue turned out to be O(length^2) worst case behaviour of the regex handling. This was significantly worse in the highlighter because the highlighter pulls all matches from the source, where the query portion only had to find one match to accept a document. The typical case for these algorithms is O(length), which is why we havn't seen a problem before, but regex's prefixed with unrestricted wildcards (.+, .*, etc) could trigger the worst case behavior.

The highlighter was fixed by switching to a two-pass algorithm when a leading unrestricted wildcard is detected. The two pass algorithm first runs a reverse regex over the source to find all valid start points of the regex. The standard forward regex can then lookup the next valid start point in a bitset instead of applying the DFA to all positions of the source. Overall this uses a bit more memory, but guarantees we find all non-overlapped matches in the source in O(length) time.

The query was fixed by prepending .* to all regexes, which means we can scan the string a single time from beginning to match (or end), rather than trying all the possible start points. This also reduces the worst case complexity from O(length^2) to O(length).

Both of these changes require adjusting the regex the user gives us, which makes queries more likely to hit the max determinized states limit of the DFA, and makes determinizing them take more cpu/memory. To check impact i pulled all regexp's run in production for the last 90 days (6.5M queries). These regular expressions were all run through the old and the new code. Only 6 queries out of the 6.5M are rejected by this new code but not the old code. The old code took on average 1.0ms to parse a regex, the new one increased to 1.1ms which is fairly inconsequential. Both seem a reasonable tradeoff.

Smalyshev triaged this task as Normal priority.Jun 1 2018, 6:54 PM
debt closed this task as Resolved.Jun 14 2018, 8:17 PM
debt added a subscriber: debt.

whew, lots of work!

Vvjjkkii renamed this task from regex highlighting doesn't always timeout as expected to 0ccaaaaaaa.Jul 1 2018, 1:08 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii raised the priority of this task from Normal to High.
Vvjjkkii removed EBernhardson as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot assigned this task to EBernhardson.
CommunityTechBot lowered the priority of this task from High to Normal.
CommunityTechBot closed this task as Resolved.
CommunityTechBot renamed this task from 0ccaaaaaaa to regex highlighting doesn't always timeout as expected.
CommunityTechBot added subscribers: gerritbot, Aklapper.