Page MenuHomePhabricator

Prefix search API doesn't return "index" field when "redirects" is enabled.
Closed, ResolvedPublic2 Story Points

Description

For example, consider the following query:

http://en.wikipedia.org/w/api.php?action=query&prop=pageterms&generator=prefixsearch&redirects=&gpssearch=obama&gpslimit=20

All of the search results that are not redirects correctly have an "index" field that specifies the result's position in the list that is shown to the user. However, the results that are redirects do not have an "index," so a consumer of the API no longer knows in which position to display these results.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Dbrant raised the priority of this task from to Needs Triage.Mar 16 2015, 2:27 AM
Dbrant added a subscriber: Dbrant.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2015, 2:27 AM
Anomie added a subscriber: Anomie.Mar 16 2015, 1:10 PM

This is intentional, because doing anything else raises a lot of questions:

  • What index should be returned for the redirect target?
    • Is that necessarily valid when the redirect target wasn't actually returned as the result of the prefixsearch?
  • What if both the redirect and the target page are returned in the same result set?
  • What if multiple redirects were returned pointing to the same result page? Which one wins?
  • Would all end-users agree with you on the answers to all of these questions, or is it limited to your particular use case? What about for uses of this extra-data feature besides prefixsearch, would the answers still make sense?
  • What index should be returned for the redirect target?

It would be the index that the un-followed redirect result would have had.

Intuitively I would expect to see something like the following:

  • If the "redirects" parameter is enabled, the returned prefixsearch results would remain the same (with the same results and indexes), except some of them may have a "redirects-to" property that points to the target page. That way, we could show the user search results that look like "Result X (redirected from Result Y)".

If there is any duplication in the results due to the redirect target, we already combine duplicates into a single result (from doing prefixsearch and full-text search simultaneously), so it won't be an issue.

It would be the index that the un-followed redirect result would have had.

And what about all the rest of the questions?

If the "redirects" parameter is enabled, the returned prefixsearch results would remain the same

Sorry, that's not at all how the 'redirects' parameter works.

If generator=prefixsearch returns 5 pages that all redirect to the same target, the API result is going to have one entry for the target page, not 5. If you want each individual redirect returned, then don't specify 'redirects', but then it's not going to give you what you probably want for prop=pageterms.

Aklapper triaged this task as Low priority.Mar 17 2015, 12:10 PM

What currently influences the presence of redirects and the order in which they appear? The results seem pretty consistent but not fully deterministic. For example, the following query sometimes returns "Obama" at the top of redirects list, but other times returns "Obama crush" as the first redirect:

https://en.wikipedia.org/w/api.php?action=query&prop=pageterms&format=json&wbptterms=description&generator=prefixsearch&redirects=&gpssearch=obama&gpsnamespace=0&gpslimit=17

And when "Obama crush" isn't the first redirect on the list, it typically isn't on the list at all...

Restricted Application added a project: Discovery. · View Herald TranscriptJul 27 2015, 8:41 PM

What currently influences the presence of redirects and the order in which they appear?

The underlying search engine determines which results are returned from prefixsearch. It does not necessarily give the same set of items every time for the same query, and in particular items might move across the 'limit' boundary.

Items in the 'redirects' array in the result are not intentionally ordered. They're in whatever order the database chooses for the unordered bulk query of the redirect targets for all pending redirects in the result set, which is not required to be deterministic or consistent. For MySQL/MariaDB it's likely to use the primary key on the redirect table and thus will happen to return the redirect rows in page_id order.

So just to clarify, is there no way to preserve the index parameter from prefixsearch when resolving redirects? (i.e. include the original index with the redirected result) (In the case of multiple prefixsearch results pointing to the same redirect, the index can be that of the first prefixsearch result in the set)

Or, a broader question: How can we ascertain the intended order of our search results when redirects=true?

Ironholds moved this task from Needs triage to Search on the Discovery board.Aug 4 2015, 8:16 AM

Change 230521 had a related patch set uploaded (by EBernhardson):
Keep 'index' in the result set when resolving redirects

https://gerrit.wikimedia.org/r/230521

EBernhardson added a subscriber: EBernhardson.EditedAug 10 2015, 4:03 PM

What if we add redirect handling 'modes'? With no value or an unknown value in the redirects parameter you get the default handling (perhaps with a warning for unknown values), specifying redirects=preserve-index (or some such name) would do as is requested here? Or if overloading redirect is undesired a new query param can be defined.

What if we add redirect handling 'modes'? With no value or an unknown value in the redirects parameter you get the default handling (perhaps with a warning for unknown values), specifying redirects=preserve-index (or some such name) would do as is requested here? Or if overloading redirect is undesired a new query param can be defined.

I think that the behaviour requested here should be the default behaviour; there is an expectation if you request a list of search results for a given query that the results be ordered. I'm fine with adding an API parameter for the old behaviour.

@Deskana the problem with making anything the default parameter is that this redirects parameter is not specific to the search query, it is an API wide parameter that is already widely used.

@Anomie another option, what if we merged the generator data into the entry in 'redirects', then the client still gets the index and there is no 'random' decision happening in the api by chance of ordering.

Deskana added a comment.EditedAug 10 2015, 5:27 PM

@Deskana the problem with making anything the default parameter is that this redirects parameter is not specific to the search query, it is an API wide parameter that is already widely used.

I don't want us to build up to the state where when we're making recommendations to third parties about how to use our search API that we have to say "...and stick all these parameters on the end to make it behave like an actual sensible search API".

That said, Discovery's priority is higher for users of Wikimedia sites higher than for third party API users, so if we must add an API parameter for this to get a fix working fast, then we should do that.

Change 230585 had a related patch set uploaded (by EBernhardson):
Include generated metadata for redirects

https://gerrit.wikimedia.org/r/230585

Deskana raised the priority of this task from Low to High.Aug 10 2015, 7:44 PM

Changing priority to high to reflect the fact that Discovery is at work on this.

looks like this might get pushed back till next week, the primary api maintainer is on vacation.

Change 230521 abandoned by DCausse:
Keep 'index' in the result set when resolving redirects

Reason:
This solution won't work, sorry for the noise.

https://gerrit.wikimedia.org/r/230521

Deskana added a comment.EditedAug 13 2015, 8:35 PM

David's patch was abandoned, but there is a patch from @EBernhardson that needs review once @Anomie is back from vacation: https://gerrit.wikimedia.org/r/230585

Change 230585 merged by jenkins-bot:
Include generated metadata for redirects

https://gerrit.wikimedia.org/r/230585

This is now completed in terms of the API returning the appropriate information (in next weeks train deploy). The relevant consumers need to now be updated to use this information.

This needs documentation for how API consumers can update their clients. This is tracked in T110231.

Deskana reopened this task as Open.Aug 27 2015, 4:37 PM

Given the discussion taking place on T110231 where the end users of this change are unsatisfied, I am reopening this task for further review.

Given the discussion taking place on T110231 where the end users of this change are unsatisfied, I am reopening this task for further review.

The stuff they're wanting there isn't going to happen, so I'd say just close this as either Resolved or Declined. I don't care which.

The stuff they're wanting there isn't going to happen

I'm sorry. What? Yes it will. If we need to build a new API cluster to fix architectural mistakes, so be it, but "I don't like it" doesn't trump reality.

"I want it" doesn't trump the fact that it's not logically possible without special-case hacks all over the place either.

"I want it" doesn't trump the fact that it's not logically possible without special-case hacks all over the place either.

Then we need to build a new API? OK.

bd808 added a subscriber: bd808.Aug 27 2015, 11:18 PM

For the sake of the people in this discussion that aren't completely familiar with how the Action API works for a query like the Obama example, I'm going to try and give a basic outline. I just traced through all of this today for the first time, so if I get a few bits wrong here and there hopefully @Anomie and @EBernhardson can jump in to correct them.

This is the api query I'm going to walk through:
https://en.wikipedia.org/w/api.php?action=query&prop=pageterms&generator=prefixsearch&gpssearch=obama&gpslimit=20&redirects&continue

The query string here says:

  • action=query&prop=pageterms
    • I want a list of wiki pages including all terms associated with those pages.
  • generator=prefixsearch&gpssearch=obama&gpslimit=20
    • The pages I'm interested in will be found by using a prefix search with the search string 'obama' and I only care about the first 20 results.
  • redirects
    • When you make this list resolve all of the redirects you find so I only get a list of the destination pages.
  • continue
    • Tell me how to get to the second page of data in new school way (which I will ignore anyway because I get all 20 results in one go but it shuts up a warning).

The use of generator=prefixsearch throws in an implicit requirement as well:

  • When I do a prefix search, I want the final results to tell me what order the search returned things in rather than just having the results in the natural order of the database query for page information.

When api.php reads this set of instructions it does roughly this behind the scenes:

  1. Create an ApiQuery object to handle the request
  2. ApiQuery uses an ApiQueryPrefixSearch to run the equivalent of https://en.wikipedia.org/w/api.php?action=query&list=prefixsearch&pssearch=obama&pslimit=20&continue
  3. Loop over the results from the search and:
    1. Get the list of Title objects matching the data returned from ElasticSearch to feed into the next step
    2. Hang on to a nested array of [namespace][dbkey] => [ 'index' => order in prefix search results ] for each Title found to use later in adding index data to the final results
  4. Create an ApiPageSet to fetch data from the page table using the list of Titles from ApiQueryPrefixSearch
  5. Scan the SQL query results for redirects and while there are still redirects:
    1. Lookup the redirect target from the redirect table
    2. Replace the page info in the current results with the redirect target
    3. Keep track of all the from => to transitions that we resolve
    4. Check for redirects again in the updated list (resolves the redirects recursively until none are left)
  6. Iterate over the saved order information from ApiQueryPrefixSearch and merge it into the pages that are in our new resolved results (adds the index value for pages that matched directly)
  7. Iterate over the from => to transitions and merge in the saved index for any page where we have a match in saved order information from ApiQueryPrefixSearch. (This functionality was added in response to this bug by @EBernhardson)
  8. Output the results

So at the point that the results are output, we have a collection of pages which will have an index if they were mentioned by Title in the ApiQueryPrefixSearch results and (possibly) a collection of from => to redirect pairs that were resolved by ApiQuery. The redirect pairs will also have index data if the from Title appeared in the ApiQueryPrefixSearch results.

From the point of view of the generic ApiQuery module which can be driven by a number of different generator data sets, this is all a generic process. It just asks the generator for a list of titles, page ids or revision ids and uses that to find pages.

The hook points that are used to decorate the results with index data from the ApiQueryPrefixSearch is generic as well. Any generator can store any information based on a Title that it wants to have merged into the final page information and it will be merged in the same way. The lack of special handling for particular combinations of generators and properties makes the backing system relatively easy to maintain and extend.

The discussion after T110231#1575192 is asking for a perfectly reasonable sounding fix for the complexity of merging data from the redirects collection into the pages collection when generator=prefixsearch and redirects are used together from the perspective of an API consumer who is not using a higher level abstraction to manage the complexity of talking to the Action API. These users want a simple way to accomplish their end goal of a list of pages ordered by search relevance. This is the sort of nicety that an Action API wrapper (eg Pywikibot) generally provides.

As the current keeper of sanity and fixer of bugs in the whole Action API system, @Anomie is coming to this discussion with a perspective of wanting implemented solutions to be generic and to work properly (or at least consistently) no matter what combination of generators and properties are requested. From his point of view this isn't a problem getting ordered search results in a consuming application; it is instead a problem of deciding if there is a safe and sane way to fulfill the feature request at the general purpose API level that will be beneficial to multiple use cases or at the very least not detrimental to multiple use cases and painful to maintain.

If we can redirect the discussion back to attempting to decide if there are indeed universal answers to the questions @Anomie posed in T92796#1121265 or some other generic change to the Action API that would allow the desired result we have some chance of a productive outcome. There will be no productive result if the discussion degenerates into attacks on the basic infrastructure or @Anomie's duty to protect our ability to maintain and extend the system.

so if I get a few bits wrong here and there hopefully @Anomie and @EBernhardson can jump in to correct them.

As a general description it's accurate. There are some minor detail errors, e.g.

and (possibly) a collection of from => to redirect pairs that were resolved by ApiQuery.

it's actually ApiPageSet that resolves the redirects. I could go through and flag all of these, but I'm guessing that would be excessive detail that isn't really relevant to the discussion at hand.

As the current keeper of sanity and fixer of bugs in the whole Action API system, @Anomie is coming to this discussion with a perspective of wanting implemented solutions to be generic and to work properly (or at least consistently) no matter what combination of generators and properties are requested. From his point of view this isn't a problem getting ordered search results in a consuming application; it is instead a problem of deciding if there is a safe and sane way to fulfill the feature request at the general purpose API level that will be beneficial to multiple use cases or at the very least not detrimental to multiple use cases and painful to maintain.

I endorse this summary of my viewpoint.

So how can we move this forward? One idea could be setting some sort of merge policy on the ApiPageSet? ApiQueryPrefixSearch receives the ApiPageSet as an argument. it could do something like:

$pageSet->setRedirectMergePolicy( function ( $current, $new ) {
    if ( !isset( $current['index'] ) || $new['index'] < $current['index'] ) {
        return $new;
    }
    return $current;
} );

The default merge policy would always return $current which retains current behavior. Generators that want to have data merged can set their own policy as above. I havn't written any actual code so there might be some problems with this idea...

I spoke with @Anomie, and while he would prefer a clientside solution, and has concerns about api complexity, he does not object to @EBernhardson's merge policy proposal.

Deskana added a subscriber: TJones.Sep 3 2015, 3:21 AM

I spoke with @Anomie, and while he would prefer a clientside solution, and has concerns about api complexity, he does not object to @EBernhardson's merge policy proposal.

Nice. @EBernhardson, can you or one of the other devs take a stab at implementing this? @TJones was looking for work yesterday, perhaps he could pick it up.

EBernhardson removed EBernhardson as the assignee of this task.Sep 10 2015, 4:47 PM
TJones removed a subscriber: TJones.Sep 15 2015, 2:45 PM
EBernhardson edited a custom field.

Change 243908 had a related patch set uploaded (by DCausse):
Allow generators to keep track of resolved redirects

https://gerrit.wikimedia.org/r/243908

Change 243908 abandoned by DCausse:
Allow generators to keep track of resolved redirects

Reason:
Won't work (thanks for your time).

https://gerrit.wikimedia.org/r/243908

dcausse removed dcausse as the assignee of this task.Oct 6 2015, 4:33 PM
dcausse moved this task from Needs review to in progress on the Discovery-Search (Current work) board.

Change 244348 had a related patch set uploaded (by EBernhardson):
Implement ApiPageSet::setMergePolicy()

https://gerrit.wikimedia.org/r/244348

Change 244348 merged by jenkins-bot:
Implement ApiPageSet::setRedirectMergePolicy()

https://gerrit.wikimedia.org/r/244348

Anomie closed this task as Resolved.Oct 8 2015, 6:50 PM