Page MenuHomePhabricator

"prefer-recent" keyword with out-of-bounds boost value breaks search
Closed, DeclinedPublicBUG REPORT

Description

Update: The original description was a contrived test case that's turned out to be more confusing than illuminating. A more apt description of the issue is that when CirrusSearch is given a prefer-recent parameter with a boost value that is out of bounds it returns zero search results. And the most direct way to reproduce is with these links: workingbroken.

Original task description follows:

Steps to replicate the issue (include links if applicable):

  1. Log out or open a private browsing window
  2. Open https://en.wikisource.org/wiki/Special:Search
  3. Enter tech news prefix:Wikisource:Scriptorium in the search box
  4. Observe obvious search results
  5. Add prefer-recent:2,30 keyword to the end of the search box (keeping the original search terms)
  6. Observe zero results returned

What happens?:
When the prefer-recent: keyword is present in the search phrase no results are returned.

What should have happened instead?:
The same results as when the prefer-recent: keyword is not present should be returned, only in a different sort order.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
dcausse closed this task as Invalid.EditedOct 11 2023, 3:29 PM
dcausse subscribed.

There are two problems with your query:

  • prefer-recent: expect a boost between 0 and 1, using 2 is not allowed and will fallback to default prefer-recent settings
  • the prefix: keyword is greedy so adding anything after it will be treated as the page prefix to search for.

To fix your query you have mutiple options:

  • reorder the keywords to make sure that the prefix is the last one: tech news prefer-recent:0.6,30 prefix:Wikisource:Scriptorium
  • use the subpageof keyword that is not greedy: wikisource:tech news subpageof:Scriptorium prefer-recent:0.6,30 (prefixing the query with wikisource: will force the query to run only on the Wikisource namespace, similar behavior can be achieved by selecting the Wikisource namespace from the UI). Note that subpageof has a slightly different semantic as it will only search for subpages (Scriptorium/*) such that pages like Scriptorium itself or a page like Scriptorium One are not part of the results

Please feel free to re-open this ticket if I misunderstood your issue.

Thanks for the explanation. Unfortunately, the actual case where this was reported was in a search link generated by Extension:InputBox, so I don't control the order of arguments.

The place it's used is at the top of s:Wikisource:Scriptorium (the local village pump) to search both the current page and the archives. The search box is transcluded from /Navigation, which calls s:Template:Engine, which in turn uses #tag:inputbox to call the extension. It has prefix and searchfilter parameters from which it generates the Special:Search.

This also apparently used to work at least as of 4 July 2021, so either CirrusSearch or Extension:Inputbox must have changed at some point after (modulo the invalid boost value).

@Xover I don't recall the prefer-recent keyword to ever have worked with a syntax like prefer-recent:2,30, searching via global search it appears that this page is almost the only one using this syntax across all WMF wikis.

Is it possible that the search box in /Navigation never actually worked since this edit?

Looking at the CirrusSearch codebase I don't any changes in this part of the code that could explain a possible regression since 2021, last changes in the source file appear to be in 2018.

I would suggest to fix /Navigation as I don't see anything to fix in the software.

@Xover I don't recall the prefer-recent keyword to ever have worked with a syntax like prefer-recent:2,30, searching via global search it appears that this page is almost the only one using this syntax across all WMF wikis.

Oh, this is interesting. I changed the boost value to 1 and now all of a sudden it is returning search results. But that makes no sense since 1) as you mentioned above an out-of-range boost value should be treated as the default boost value (.5?), and 2) provided your above analysis that the problem was caused by prefix: being greedy and prefer-recent being last, then changing the boost value should have no effect on this.

Extension:Inputbox would then seemingly be passing CirrusSearch different input based on whether the boost value is 1 or 2 (well, out-of-bounds vs. within-bounds, presumably).

But InputBox's code for this is just:

if ( $this->mPrefix !== '' ) {
  $htmlOut .= Html::hidden( 'prefix', $this->mPrefix );
}

if ( $this->mSearchFilter !== '' ) {
  $htmlOut .= Html::hidden( 'searchfilter', $this->mSearchFilter );
}

prefix was added in 2008 (by @Rainman I think), and searchfilter was added in 2017 (by @Harej in T147951). As best I can tell they've always been in this order, and there's no logic I can see that might affect it.

All it does is add hidden form fields, and while field order in HTML form submissions is not guaranteed, the respective URLs generated (namespace options removed manually for ease of reading) are:

https://en.wikisource.org/wiki/Special:Search?fulltext=Search+the+Scriptorium+%28includes+archives%29&fulltext=Search&prefix=Wikisource%3AScriptorium&search=tech+news&searchfilter=prefer-recent%3A1%2C30

https://en.wikisource.org/wiki/Special:Search?fulltext=Search+the+Scriptorium+%28includes+archives%29&fulltext=Search&prefix=Wikisource%3AScriptorium&search=tech+news&searchfilter=prefer-recent%3A2%2C30

workingbroken

Based on this it looks to me like this is CirrusSearch behaving differently in this case based on whether the boost value is out of bounds or not. Is it conceivable that when boost is out of bounds the whole prefer-recent term is treated like normal text instead of a keyword? As just text it would then be eaten by a greedy prefix:. If so that's a really non-intuitive failure mode, especially as it is supposed to just fall back to the default boost value. But even emitting an error message about the out-of-bounds boost value (or an invalid prefer-recent keyword) would be better than just not returning any results at all for this case.

Or could it be that a weight > 1, after it gets passed through exp() etc., blows up down in Elastica somewhere? The docs suggest it takes any float value, but once exp() functions are in the picture I guess overflowing a float type is a reasonable possibility to check. If so then the problem wouldn't seem to be that prefix: is eating it, but rather that the search itself blows up and returns nothing. In which case armor-plating and input validation would seem to be in order.

Xover renamed this task from "prefer-recent" keyword breaks search to "prefer-recent" keyword with out-of-bounds boost value breaks search.Oct 12 2023, 7:23 AM
Xover updated the task description. (Show Details)

Based on this it looks to me like this is CirrusSearch behaving differently in this case based on whether the boost value is out of bounds or not. Is it conceivable that when boost is out of bounds the whole prefer-recent term is treated like normal text instead of a keyword? As just text it would then be eaten by a greedy prefix:. If so that's a really non-intuitive failure mode, especially as it is supposed to just fall back to the default boost value. But even emitting an error message about the out-of-bounds boost value (or an invalid prefer-recent keyword) would be better than just not returning any results at all for this case.

Sorry about that I should have been clearer, the behavior of prefer-recent: is indeed to not consume its value if what's behind is not understood, reason is that we have queries like prefer-recent:tech new where the user wants to search for tech news using default prefer-recent settings (0.6,160). So a query like prefer-recent:tech new is roughly equivalent to tech news prefer-recent:. Another silly example might be prefer-recent:3,14 for searching for pages recently edited that contain an approximation of pi in a language that uses a comma for the decimal point. In your case prefer-recent:2,30 is indeed searching for the text "2,30" which is why it's not finding any results in the /Navigation use-case.
I believe that this behavior was inherited from the search system that predates CirrusSearch.
Changing this behavior now does not seem wise to me as it might break existing queries.

The reason we do not accept boost values greater than 1 is because the function is designed to ultimately compute a factor that is between 0 and 1, allowing more than 1 would result in a factor greater than 1 for recent articles, there's nothing wrong with this and elasticsearch would be fine with it but it's simply not how the function was originally designed.

Regarding InputBox, I believe there never was any problem there, the prefer-recent: portion has always been managed differently by passing the &prefix= URI parameter to Special:Search.
In the past Special:Search used to rewrite the query by appending it at the end of the user query but in T198318 we changed this to keep this prefix context internal such that the query you see in the search box does not contain the prefer-recent: keyword.
So your initial problem never was related to the greediness of the prefer-recent: keyword, I think I got confused by your test case.

All that to say that I'm not sure there are any regression here and I don't think we should accept boost values outside the [0, 1] range because it would be a breaking change in term of query parsing logic.

I understand that it might confusing when you write something that resembles a value that prefer-recent: could accept such as 2,30 and perhaps showing a warning might help, the hard part is detecting what looks like a plausible value for prefer-recent:.
Would this make sense to you? If yes could you re-phrase the ticket accordingly because with its current wording I'm not sure we could do much and we might decline it again. Thanks!

Man, this is getting complicated…

Is there a valid use case for a bare prefer-recent: keyword? When would one want to explicitly ask for the default values in the query? And relatedly, when would one use that without a space between the prefer-recent: keyword and the following search term?

I think there is a reasonable expectation that when giving CirrusSearch something that is a syntactically valid keyword (i.e. prefer-recent:n,n) that it will behave like function arguments or command line arguments or template arguments. And when such an argument is syntactically valid only with values outside the valid range (i.e. prefer-recent:2,30) that the behaviour doesn't change but you either get default values or an error message. This expectation is reinforced when using it by way of Inputbox because there you're really giving arguments to a parser tag rather than interactively typing text into a semi-DWIMy text field.

Or put another way, if you recognize that part of the input as a prefer-recent: keyword my expectation would be that it be consumed; and if its argument value is invalid the handling of that is a separate issue (using defaults, throw error, etc.).

But I'll have to try to wrap my head around the larger context here to see if there is a reasonable proposed change in there somewhere.

Gehel subscribed.

I'm closing this as "work as expected". I fully understand that the expectation is complicated and confusing and that this might need to be addressed. But probably as part of a larger redesign of how the query parser works as a whole. Feel free to re-open if you disagree.