Page MenuHomePhabricator

Search for nonsense "xxxxaaaa" "yyyyybbbbb" finds matches
Closed, ResolvedPublic

Description

https://de.wikipedia.org/w/index.php?search=%22xxxxaaaa%22+%22yyyyybbbbb%22
always results in 3 non-obvious matches on WP:de

https://en.wikipedia.org/w/index.php?search=%22xxxxaaaa%22+%22yyyyybbbbb%22
always results in 4 non-obvious matches on WP:en

(with overlapping topics, all dealing with special characters)

You can change the search strings to something else senseless, but this will not have an influence on the results.

By skipping quotation marks, there will be no more matches.

Event Timeline

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptJul 18 2017, 11:25 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 366292 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Don't issue near match query with empty query

https://gerrit.wikimedia.org/r/366292

Change 366292 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Don't issue near match query with empty query

https://gerrit.wikimedia.org/r/366292

EBernhardson added a comment.EditedJul 27 2017, 6:59 PM

This patch had an unexpected side effect found by our browser tests, but not noticed before merging because they were not running.

Basically: http://cirrustest-cirrus-browser-bot.wmflabs.org/w/api.php?action=query&enablerewrites=0&format=json&formatversion=2&list=search&srnamespace=0&sroffset&srprop=snippet%7Ctitlesnippet%7Credirectsnippet%7Csectionsnippet%7Ccategorysnippet%7Cisfilematch&srsearch=catapult+-%22asdf%22+%22two+words%22
This is a search for catapult -"asdf" "two+words"

This should not return the document 'Catapult', because the content contains the word asdf in the content of the article. Prior to this patch the query looked like:

"bool": {
    "minimum_should_match": 1,
    "should": [
        {
            "query_string": {
                "query": "catapult  NOT (all.plain:\"asdf\"~0^1)   (all.plain:\"two words\"~0^1)",
                "fields": [
                    "all.plain^1",
                    "all^0.5"
                ],
                "auto_generate_phrase_queries": true,
                "phrase_slop": 0,
                "default_operator": "AND",
                "allow_leading_wildcard": false,
                "fuzzy_prefix_length": 2,
                "rewrite": "top_terms_blended_freqs_1024",
                "max_determined_states": 500
            }
        },
        {
            "multi_match": {
                "fields": [
                    "all_near_match^2"
                ],
                "query": "catapult   "
            }
        }
    ]
}

The only change from this patch is in the near match, which now looks like:

{
    "multi_match": {
        "fields": [
            "all_near_match^2"
        ],
        "query": "catapult"
    }
}

This looks like a pre-existing problem that was hidden by the extra spaces which allowed this test to pass, specifically the not filters in query string can be overridden by a catapult match. Not sure what the best solution to this is, it seems the not filter needs to make its way into an actual filter and not bool'd with the near match query. I worry though that simply extracting the negate and putting it into a filter will be a performance problem (maybe?) as we would essentially be issuing a query string query for "NOT asdf"

@dcausse Any thoughts on the change in behaviour shown above?

perhaps a solution would be changing should:[ query_string, near_match] into must: [query_string], should: [near_match] but i don't have a strong enough understanding of when the near match query is supposed to match, or really what the purpose of it is...

Same problem for another browser test: http://cirrustest-cirrus-browser-bot.wmflabs.org/wiki/?search=intitle%3ARevDelTest+"delete+this+revision"

this should only match pages with "delete this revision" but due to the bool should matching with near match, a page without "delete this revision" matches.

Change 368316 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Fix regression from trimming near match queries

https://gerrit.wikimedia.org/r/368316

This is the problem that just keeps on giving.

I thought perhaps i could fix the problem by changing the near match into a bool query, with basically bool: { should: [ near_match ], filter: [ query_string] }, filling the query string query with things that didn't make it into the near match.

This ends up breaking another query in our test suite, catapult NOT "two words". This is because the near match query is being performed against catapult NOT. Obviously this is invalid, but its what the code has "always" done. The problem is then my query string filter becomes a positive filter for "two words", and not the expected negative filter. Not sure what to do with this still....

Since this has turned into a bit of a minefield, for what was supposed to be a simple fix taking an hour or so, i'm tempted to roll back the initial fix and put this into our queue of things to fix when we revisit query parsing. Thoughts?

Change 368320 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Revert "Don't issue near match query with empty query"

https://gerrit.wikimedia.org/r/368320

Change 368316 abandoned by EBernhardson:
Fix regression from trimming near match queries

Reason:
just turns up more issues from the browser tests...

https://gerrit.wikimedia.org/r/368316

dcausse added a comment.EditedJul 28 2017, 8:19 AM

@EBernhardson there're definitely tons of small unexpected and uncontrolled behaviors in the current parsing logic and your initial fix is not to blame, it was just right imo:

  • we trim a query made of spaces, there's no reason not to do the same for the nearmatch remainder.

Maybe instead of trimming we could clean the nearmatch if it's only made of spaces?

Change 368320 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Revert "Don't issue near match query with empty query"

https://gerrit.wikimedia.org/r/368320

Change 368364 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Don't issue near match query with empty query

https://gerrit.wikimedia.org/r/368364

looks like it passes browser tests and reading back your commit message it looks like it was your initial intention Don't issue near match query with empty query.
Nevertheless trimming sounded right to me and it failed for the reasons you mentioned earlier that I agree to fix when we revisit the parser. It'll be much easier to accept and justify some changes in behavior when refactoring the parser.

Change 368436 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/CirrusSearch@master] Don't issue near match with empty query

https://gerrit.wikimedia.org/r/368436

Change 368436 abandoned by EBernhardson:
Don't issue near match with empty query

Reason:
turns out dcausse submitted a patch for same thing already in I9677b2be00692fa9b7712efb64b68e93346289bf

https://gerrit.wikimedia.org/r/368436

Change 368364 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Don't issue near match query with empty query

https://gerrit.wikimedia.org/r/368364

debt closed this task as Resolved.Jul 28 2017, 7:20 PM
debt triaged this task as Normal priority.