Page MenuHomePhabricator

Unexplainable .* behavior in intitle regex search
Open, MediumPublic8 Estimated Story PointsBUG REPORT

Description

There's a discussion at English Wikipedia about why

intitle:"Clover" intitle:/Clover.*West Virginia/

produces no results, while, for example

intitle:"Clover" intitle:/Clover.*West /
and
intitle:"Clover" intitle:/Clover.*\sWest Virginia/

currently find three pages.

This looks like a bug, but it's hard to tell where it's coming from.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Update search-extra plugins to wmf2repos/search-platform/opensearch-plugins-deb!22ebernhardsonwork/ebernhardson/bump-extra-wmf2master
Customize query in GitLab

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Also compare
intitle:"Clover" intitle:/Clover.*West Virginia/ (0 results)
intitle:"Clover" intitle:/Clover.*West\sVirginia/ (3 results)

I started looking over this, unfortunately it's a bit complex. Something appears to be going awry during regexp query acceleration. We essentially rewrite something like Clover into a boolean trigram expression like (clo AND lov AND ove AND ver), but also handling regexp syntax. This tends to generate much longer expressions than necessary, so there is also a step that simplifies the boolean expression. Plugging this query into our test suite it appears we are generating some non-sensical trigrams from the .* portion, and then the simplify stage happens to be taking those. I haven't pinned it down exactly yet, but i suspect the query variants that work are due to the simplification stage choosing different trigrams.

There is nothing obvious that has changed here recently. We did add some functionality to regexp recently, but it shouldn't have affected this part of the query. It will take a little more time to figure out what's wrong and come up with a fix.

I looked into this locally and it seems like null-ngram transitions from wildcard ranges are consuming the maxTransitions budget (max_ngrams_extracted=100), which may cause trigrams like wev from DFA fallback paths instead of valid ones. I tested a small change in NGramAutomaton.traceRemainingStates() that only counts trigram-producing transitions toward the limit, and it appears to fix the issue while passing all existing tests.

EBernhardson triaged this task as Medium priority.
EBernhardson set the point value for this task to 5.

>>! In T424820#11888501, @HakanIST wrote:

I looked into this locally and it seems like null-ngram transitions from wildcard ranges are consuming the maxTransitions budget (max_ngrams_extracted=100), which may cause trigrams like wev from DFA fallback paths instead of valid ones. I tested a small change in NGramAutomaton.traceRemainingStates() that only counts trigram-producing transitions toward the limit, and it appears to fix the issue while passing all existing tests.

Thanks for looking! I've indeed come to the same conclusion that the root cause is probably related to the way the algorithm performs when it hits the transition cap. I played around with changing the way we count states, also played around with merging transitions that go to the same destination. But I have an inkling that these fixes are both bandaids, allowing it to walk more transitions (by not counting some of them) and get into a place where we have enough content in the acceptStates. But those aren't guarantees, i suspect they just happen to allow walking far enough in the regex automaton to get an acceptable result.

Currently i'm re-reviewing the initial writeup on this technique and comparing it to the linked implementations to see if perhaps there are some subtle ways that we are handling the acceptStates different than intended. If i can't find the deeper root cause will ship the bandaid, but I suspect that this error is not an isolated occurance and has existed for a decade+ in our implementation. There are likely to be complex regex queries that appear to work but don't actually return all the expected matches. It happened to trigger now because we started rewriting . into a character class that rejects some utf-8 non-characters, turning a 2-state transition into more states (due to how the * causes the regex to continually loop back on itself).

Great analysis! I tested the bandaid fix locally and it breaks existing tests (bigButNotTooBig, maxNgrams, etc.) via expression explosion, so you are right that the budget is accidentally guarding against that. The key issue seems to be that when budget exhaustion pushes a state into acceptStates, it loses all downstream trigrams, effectively severing the AND relationship between regex segments like "Clover" and "West Virginia".

Went down a few wrong paths, but should have a fix ready for this today or tomorrow. A quick summary:

  • Investigated adjusting the way the current algorithm counts transitions, but it seems to push the errors further down into the algo and not fully resolve the issue
  • Investigated using an algorithm to find the dominant path(s) through the DFA to guide the walking. This does resolve the initial problem with finding islands of characters, and avoids hitting the transition max by not getting stuck in loops, but the walk is still full of heuristics. For example loop unrolling like ab{2,}a should resolve to aba AND bab but is tedious to get out of that system. In general any looping in the DFA is still tedious.
  • While working on the dominance path version i realized lucene exposes it's AST, which is exactly what the original google codesearch implementation worked off of. After working through it, the AST approach seems strictly better than operating on the DFA. The problem is the DFA has spread various regex constructs out into the graph edges and nodes, and we have to basically de-compile the DFA to understand what needs to move into the trigram expression. The AST version is not heuristic based, it has rules for various regex constructs and simply follows those rules.

The AST version is mostly ported over, i'm working on the test suite now and will pull some queries from our logs to run through and make sure nothing blows up.

Change #1285888 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[search/extra@master] Replace regex acceleration algorithm

https://gerrit.wikimedia.org/r/1285888

Tested the Gerrit patch locally, all tests pass. The Clover.*West.*Virginia case now correctly extracts all 12 trigrams. Great fix!

EBernhardson changed the point value for this task from 5 to 8.Tue, May 19, 5:22 PM

Change #1285888 merged by jenkins-bot:

[search/extra@master] Replace regex acceleration algorithm

https://gerrit.wikimedia.org/r/1285888

Next steps:

  • Release a new version of the search-extra plugin for 2.x
  • Update the plugins .deb
  • Update docker-registry.wikimedia.org/repos/search-platform/cirrussearch-opensearch-image

All of this should be done before the rollout of 2.x to prod clusters

Do we need to update ttmserver as well?

Do we need to update ttmserver as well?

ttmserver wont need the specific changes in this update, there is no functionality change to the parts the ttmsever uses. It might still be worth deploying it to all clusters that use the .deb to keep things consistent, but it can probably wait to pick up the change whenever it happens to be next roll-restarted.