Page MenuHomePhabricator

[Bug] Unpredictable behavior with the order of Special:Search parameters
Closed, ResolvedPublic

Description

This works just fine:
https://de.wikipedia.org/wiki/Special:Search/intitle:Deutschland_prefix:Wiki

The exact same query parameters in an other order appear to do nothing:
https://de.wikipedia.org/wiki/Special:Search/prefix:Wiki_intitle:Deutschland

Deep in the back of my head I remember somebody talking about details like this being there for compatibility reasons with the predecessor of CirrusSearch. But this specific example just does not make any sense. The parser that parses the users input should extract keywords completely independent from each other, with a regular expression similar to [a-z]+:("[^"]*"|[^ ]*), which means it either extracts prefix:Wiki and intitle:Deutschland as two whitespace separated tags, or the user has to use quotes or \ escapes if this is not what he wanted.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 1 2017, 10:11 AM
thiemowmde triaged this task as High priority.Mar 1 2017, 10:11 AM

Sadly cirrus search has few "greedy" keyword such as prefix, morelike and maybe we can consider namespace prefixes similarly even if they are not parsed by cirrus itself.
These keywords do not stop parsing when they encounter a space so basically they are always put at the end.
We really want to get rid of them and force users to put quotes but that has serious compatibility issues with existing queries, there are a non negligible number of queries out there that use prefix: (frequently used to filter pages based on a naming convention like "/Archives").
We have plans to rewrite the parser and I hope we'll address this problem at this time.

Where is this code? Where is this documented? Why not just fix it? I really can't think of a user actually using this unpredictable behavior as an intended feature.

An intermediate fix would be to stop parsing on the next colon. This would still allow to search for prefix:Wikimedia Deutschland without quotes.

The only edge case I can think of is something like prefix:User talk:Thiemo, where the colon marks the end of a namespace. This should be quite easy to support because the parser already knows all namespaces and should be able to understand the difference between prefix:User talk:Thiemo referring to a namespace and prefix:User intitle:Thiemo referring to an other CirrusSearch keyword.

The affected keywords are all the direct concrete implementation of KeywordFeature:

ack-grep "implements KeywordFeature" includes/Query/
includes/Query/PreferRecentFeature.php
18:class PreferRecentFeature implements KeywordFeature {

includes/Query/SimpleKeywordFeature.php
13:abstract class SimpleKeywordFeature implements KeywordFeature {

includes/Query/MoreLikeFeature.php
10:class MoreLikeFeature implements KeywordFeature {

includes/Query/RegexInSourceFeature.php
20:class RegexInSourceFeature implements KeywordFeature {

includes/Query/LocalFeature.php
12:class LocalFeature implements KeywordFeature {

includes/Query/PrefixFeature.php
23:class PrefixFeature implements KeywordFeature {

We can't really fix it because the syntax is ambiguous in its current state.
I'd prefer not to hack these keywords with some fragile heuristics because it will then be hard to explain to some users why it behaves like this and will be harder for us to implement a new parser.
IMHO we should start to warn users that we plan to change these keywords behavior. Cirrus has all the necessary code to display warnings so I'd suggest to simply emit a warning when prefix/morelike/local includes an unescaped space without wrapping quotes as a first step.

[…] it will then be hard to explain to some users why it behaves like this […]

It already is. Can barely be much worse than in the real world example I run into.

I understand your point and totally agree with you but it's the existing behavior and it's currently used like that, expert users always use the prefix keyword at the end of the query. I can probably inspect our search logs to determine how frequent it's being used to accelerate the fix, but I'd prefer a plan like:

  • deploy a change to warn users that the syntax will change
  • wait X weeks
  • reimplement such keywords as SimpleKeywordFeature subclasses

To add more context: see for example https://en.wikipedia.org/wiki/Wikipedia:Help_desk/Header which generate prefix queries, this particular one does not seem to generate problematic queries but it's easy to imagine other forms in other languages that could rely on the current behavior.

To give some more context: I'm currently working on https://de.wikipedia.org/wiki/Benutzer:TMg/advancedSearch.js as part of a German-Community-Wishlist project. Because of what I learned here it looks like I have to sit down and actually encode the knowledge about greedy keywords into my tool. The most minimal thing I need to do is to change the parameter order when the user submits the query, which will add quite some confusion, because it will not reflect any more what the user intended to do. This is also bad from a developers perspective, because I'm going to hard-code this knowledge in just an other place.

It's also impossible to move greedy keywords to the end if there is more than one. Even after looking into the CirrusSearch source code I can't tell if prefix: is the only greedy one.

I think we all agree that greedy keywords are cumbersome and that we need to get rid of them.
I'd suggest to ask help from @debt or @Deskana to help prioritize this work.
The main problem here is compatibility issues with existing queries, AFAIK fixing this issue will break:

  • prefix query users using it without quotes (mostly expert users and maybe some forms to generate search queries)
  • morelike: used by mobile web/apps

Unfortunately I don't see a fix where we can get rid of this greedy keywords without breaking these clients, it's why I suggested a deprecation phase with warnings.
I'd suggest (if possible) to not implement the prefix keyword in your tool for now.
Perhaps an alternative solution would be to add a new non-greedy keyword like titleprefix that you could use in the meantime?

FYI, for now I implemented the knowledge about greedy search options in https://de.wikipedia.org/wiki/Benutzer:TMg/advancedSearch.js. I hope prefix: is the only one, because – as said – I will hit a wall when the second one appears.

We do not plan to add morelike: for now. Based on what I see in the MoreLikeFeature class it looks like it can not be combined with anything else anyway, so it doesn't make much sense to add it to an advanced search form that lets users combine keywords.

I don't think we ever will add new greedy keywords (by now we know it's not the best idea :). But we'd probably have to keep old one(s) working for a while.

Moving to later - there are a few things that need to be done before we can even start tackling this - if we decide to actually do it. :)

Change 371942 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Add subpageof keyword

https://gerrit.wikimedia.org/r/371942

Change 371942 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add subpageof keyword

https://gerrit.wikimedia.org/r/371942

Change 533528 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@master] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/533528

kostajh added a subscriber: kostajh.Sep 2 2019, 8:06 AM
dcausse claimed this task.Sep 2 2019, 8:08 AM
dcausse moved this task from in progress to Needs review on the Discovery-Search (Current work) board.

All greedy keywords will have a non-greedy equivalent once https://gerrit.wikimedia.org/r/533528 is merged.

Change 533528 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/533528

Change 534296 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.21] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/534296

Change 534296 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.21] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/534296

Mentioned in SAL (#wikimedia-operations) [2019-09-04T11:47:05Z] <dcausse@deploy1001> Synchronized php-1.34.0-wmf.21/extensions/CirrusSearch/: T159321: Add morelikethis a non-greedy version of the morelike keyword (duration: 00m 57s)

Change 534580 had a related patch set uploaded (by Kosta Harlan; owner: DCausse):
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.20] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/534580

Change 534580 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@wmf/1.34.0-wmf.20] Add morelikethis a non-greedy version of the morelike keyword

https://gerrit.wikimedia.org/r/534580

Mentioned in SAL (#wikimedia-operations) [2019-09-05T11:48:31Z] <dcausse@deploy1001> Synchronized php-1.34.0-wmf.20/extensions/CirrusSearch/: T159321: Add morelikethis a non-greedy version of the morelike keyword (duration: 00m 59s)

debt closed this task as Resolved.Sep 5 2019, 6:21 PM