Page MenuHomePhabricator

Inconsistent search results when combining list=search and generator=search
Closed, ResolvedPublic

Description

Background:
In our Mobile apps, we're implementing full-text searching when the user types a search term. The way we query the api is as follows:

  • We use prop=pageprops with generator=search to get the list of results, which also gives us the image thumbnail and Wikidata ID for each result, but this returns the results in the wrong order, so:
  • We combine the query with list=search, which should give the same set of results (albeit without thumbnails or Wikidata ID), but in the correct order.
  • We then correlate the data from the two lists (expecting both lists to contain the exact same items) to arrive at the full results in the correct order.

For example, our query is something like this:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&list=search&prop=pageprops&format=json&srsearch=Barack&srnamespace=0&srwhat=text&srinfo=suggestion&srlimit=12&ppprop=wikibase_item&generator=search&gsrsearch=Barack&gsrnamespace=0&gsrwhat=text&gsrprop=redirecttitle&gsrlimit=12

However:
The two sets of results often don't match up with each other -- there are sometimes one or more pages in one list that are not present in the other. What's even more strange, this doesn't happen every time: the lists might match up in one instance, and then be different in another (for the exact same query).

For example, try executing the query linked above, and observe the last item in the returned "pages" array. 4 out of 5 times, the item will be "There's No One as Irish as Barack O'Bama", but sometimes it becomes "Statewide opinion polling for the United States presidential election, 2012", which is no longer consistent with the other list returned by list=search.


Version: unspecified
Severity: normal

Details

Reference
bz73623
Related Gerrit Patches:

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:54 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz73623.
bzimport added a subscriber: Unknown Object (MLST).
Dbrant created this task.Nov 19 2014, 9:40 PM

Just looking at the "last" item in the pages object isn't going to tell you much, because JSON doesn't guarantee ordering in hashes.

After repeated trials, I cannot reproduce a single trial where 'pages' contains a title not in 'search' or vice versa. If this does somehow occur, it's most likely due to the underlying search engine rather than anything in the API.

Just to clarify, the "search" results list is not a hash - it's an array. It was needed because the "pages" results are a hash, so the order was meaningless.

(In reply to Monte Hurd from comment #2)

Just to clarify, the "search" results list is not a hash - it's an array. It
was needed because the "pages" results are a hash, so the order was
meaningless.

Why not change the search results lists to allow it to include the properties that you need? That seems about 1000000% more efficient than making two http calls.

Search results should be mostly consisten across requests but there is nothing that keeps them entirely consistent. We're changing the underlying index constantly and we make no effort to cache because the tail is so long. You search may not even be sent to the same set of machines when you repeat it. This could cause slight shifting around the edges.

It'd be a bug if something was in the middle one time and not there the next.

Seriously, make just one query. Save some load on the search cluster and save the user some time.

Results set for "fish" before update

Attached:

Result set for "fish" after update

Attached:

Those results are super duper different.

Could you turn those screenshots into urls?

Also Cirrus intentionally returns redirects in the list so that funny search for fishspinner is actually a result for tropical cyclone. One thing we could do is penalize results that we find via a redirect as compared to a page title match. We don't do that right now. It'd probably make the results make more sense.

Still I think the underlying issue is that the list=search api doesn't return lwhat you want it to.

Mhurd added a comment.Nov 20 2014, 1:37 AM

Note that all of my usual daily search testing terms - "art", "cat", "bird", "toast", have underwent a similar degradation as "fish".

The first result is still always a match, but, for example, the second result when searching for "bird" is now "Bird-Magic rivalry", which may be a great article, but it's probably not the second-most relevant bird article on enwiki :)

I'll work on the relevance issue. Is this bug about that or the different between list=search and generator=search? I don't know much about the api issue but I can totally work on relevance.

Mhurd added a comment.Nov 20 2014, 2:05 AM

Good point - we may be conflating two issues. Thanks for tackling the relevance part!

I think this bug could then just track the "different between list=search and generator=search" part if that sounds ok?

OK. I'm not working on that now because I don't think it's as bad. Also I don't understand our API too well so I'd have to do some research. I don't even know enough to know if the bug is in CirrusSearch.

I'm going to reset the assignee to default for now.

Finally: you really should be able to get the results in one query. That you can't is also a bug.

I'm tracking the relevance issue as bug 73636

(In reply to Nik Everett from comment #12)

OK. I'm not working on that now because I don't think it's as bad. Also I
don't understand our API too well so I'd have to do some research. I don't
even know enough to know if the bug is in CirrusSearch.

The general idea is that lists return results directly, and generators use the output of a list to feed into another module (like a UNIX pipe). So generators essentially are the intended way to combine modules in one query like this. The problem is for some reason list= is also used/necessary in this case. See https://www.mediawiki.org/wiki/API:Query#Generators

(In reply to Monte Hurd from comment #8)

Here is the exact query the iOS app used for those two screenshots:
https://en.m.wikipedia.org/w/api.
php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace=
0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14
4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish

I notice this does not use indexpageids. So you're only getting it in the correct order if whatever iOS JSON parser you're using preserves order in hashes/associative arrays (possible, but not certain).

See https://en.m.wikipedia.org/w/api.php?action=query&format=jsonfm&generator=prefixsearch&gpslimit=24&gpsnamespace=0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=144&ppprop=wikibase_item&prop=pageprops|pageimages&pslimit=24&pssearch=fish&indexpageids for a version with indexpageids with a clear array of results.

Also, it's using generator=prefixsearch not generator=search (which comment #0 mentions).

Mhurd added a comment.Nov 20 2014, 4:57 AM

Matt - it seems the indexpageids are not in the right order.

And the native json parser on iOS doesn't preserve order within resulting associative arrays - just arrays. I believe this is true w/android as well.

We could write custom parsing logic to determine order if these hash entries were made to be in the correct order, but that's a hack every non-php consumer of this data would have to implement.

Would be perfect if the entries could be both ordered correctly and returned in order-preserving json array format. Can there be a flag that just caused results to come back array format?

Mhurd added a comment.Nov 20 2014, 5:02 AM

*to clarify, the array format flag would return array of page hashes, rather than hash of page hashes*

(In reply to Matthew Flaschen from comment #13)

(In reply to Nik Everett from comment #12)

OK. I'm not working on that now because I don't think it's as bad. Also I
don't understand our API too well so I'd have to do some research. I don't
even know enough to know if the bug is in CirrusSearch.

The general idea is that lists return results directly, and generators use
the output of a list to feed into another module (like a UNIX pipe). So
generators essentially are the intended way to combine modules in one query
like this. The problem is for some reason list= is also used/necessary in
this case. See https://www.mediawiki.org/wiki/API:Query#Generators

This. The general problem people seem to be running into with search as a generator is that they want to keep some indication of the ranking despite generators currently being defined as producing an unordered list of pageids (or revids).

I'm not opposed to allowing generators to provide additional properties for the generated pages (reopening bug 14859, although I'd probably do it in a different manner than requested there) but it'll take some thought as to how to best do that. I'll put something on [[mw:API/Architecture work/Planning]] so I don't forget to look into it.

As for how the API code works in this situation, it instantiates two instances of ApiQueryPrefixSearch, and each instance runs the same code to get a list of titles:

$searcher = new TitlePrefixSearch;
$titles = $searcher->searchWithVariants( $search, $limit, $namespaces );

The processing of that list of titles is different, but it should theoretically be the same titles in the list. What seems to be going on is that this code is not entirely deterministic.

(In reply to Monte Hurd from comment #8)

Here is the exact query the iOS app used for those two screenshots:
https://en.m.wikipedia.org/w/api.
php?action=query&format=json&generator=prefixsearch&gpslimit=24&gpsnamespace=
0&gpssearch=fish&list=prefixsearch&pilimit=24&piprop=thumbnail&pithumbsize=14
4&ppprop=wikibase_item&prop=pageprops%7Cpageimages&pslimit=24&pssearch=fish

I note you're using gpsnamespace=0, but not psnamespace=0. In this case that doesn't matter, though, since 0 is the default.

I have been able to reproduce the issue with this query, BTW.

(In reply to Matthew Flaschen from comment #13)

I notice this does not use indexpageids. So you're only getting it in the
correct order if whatever iOS JSON parser you're using preserves order in
hashes/associative arrays (possible, but not certain).

They're presumably using query.prefixsearch (which *is* a JSON array) for the ordering, using the pageids from there to look up the additional info in the query.pages hash.

(In reply to Brad Jorsch from comment #16)

As for how the API code works in this situation, it instantiates two
instances of ApiQueryPrefixSearch, and each instance runs the same code to
get a list of titles:
$searcher = new TitlePrefixSearch;
$titles = $searcher->searchWithVariants( $search, $limit, $namespaces );
The processing of that list of titles is different, but it should
theoretically be the same titles in the list. What seems to be going on is
that this code is not entirely deterministic.

That is why I originally WONTFIXED it. Searches aren't deterministic because the data changes under them. lsearchd was _more_ deterministic than cirrus because it changed slower and had fewer shards and replicas. The list shouldn't drastically change but it will change some. How drastic are the changes?

Also, are there things from the search results that you need that you can't get by using it as a generator? As in, if we allowed generators to return things in order somehow would that be enough for you to only make a single call?

(In reply to Nik Everett from comment #17)

Also, are there things from the search results that you need that you can't
get by using it as a generator? As in, if we allowed generators to return
things in order somehow would that be enough for you to only make a single
call?

That's correct -- if the generator returned the results in order, then we wouldn't need to do the other call.

(In reply to Nik Everett from comment #17)

The list shouldn't drastically change but it will change some. How drastic are
the changes?

The changes don't seem to be very drastic; in some quick checking, it seems to mainly be in the ordering of the results. For example, in one query using only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14 while it dropped to #16 when the query was repeated, with no change in the relative order of any of the other 24 results.

Also, are there things from the search results that you need that you can't
get by using it as a generator? As in, if we allowed generators to return
things in order somehow would that be enough for you to only make a single
call?

Allowing generators to return things in order isn't likely to happen. More likely is that generator=prefixsearch would be able to add an 'index' field into the hashes inside the 'pages' hash, so the client could sort based on that field's value.

More thoughts on this are at https://www.mediawiki.org/wiki/API/Architecture_work/Planning#Allow_generators_to_provide_data

(In reply to Brad Jorsch from comment #19)

(In reply to Nik Everett from comment #17)

The list shouldn't drastically change but it will change some. How drastic are
the changes?

The changes don't seem to be very drastic; in some quick checking, it seems
to mainly be in the ordering of the results. For example, in one query using
only list=prefixsearch "Fisherman's Atlantic City Wind Farm" was at #14
while it dropped to #16 when the query was repeated, with no change in the
relative order of any of the other 24 results.

Cool. That's pretty much what I expect.

Also, are there things from the search results that you need that you can't
get by using it as a generator? As in, if we allowed generators to return
things in order somehow would that be enough for you to only make a single
call?

Allowing generators to return things in order isn't likely to happen. More
likely is that generator=prefixsearch would be able to add an 'index' field
into the hashes inside the 'pages' hash, so the client could sort based on
that field's value.
More thoughts on this are at
https://www.mediawiki.org/wiki/API/Architecture_work/
Planning#Allow_generators_to_provide_data

Cool.

Chad just SWATed our fix to prefix search relevance. You still get redirects if they are very good and there aren't any good regular pages. Try searching for "ozymandius". Its a common misspelling of Ozymandias.

Change 175759 had a related patch set uploaded (by Anomie):
API: Allow generaotrs to return data

https://gerrit.wikimedia.org/r/175759

Patch-For-Review

Matt - it seems the indexpageids are not in the right order.
And the native json parser on iOS doesn't preserve order within resulting associative arrays - just arrays. I believe this is true w/android as well.

And JavaScript. indexpageids is meant to solve this issue (it does for most API modules but apparently doesn't in this case for other reasons).

Would be perfect if the entries could be both ordered correctly and returned in order-preserving json array format. Can there be a flag that just caused results to come back array format?

I don't think we need a general API flag for this, per above (rather than an array of all the data, indexpageids is the intended solution).

In T75623#787290, @Mattflaschen wrote:

And JavaScript. indexpageids is meant to solve this issue (it does for most API modules but apparently doesn't in this case for other reasons).

Rather than preserving order, indexpageids is meant to make it easier to iterate over the pages hash for languages without native hash iteration (or with iteration that's not convenient to actually use).

I was hoping to see more detail on the reasoning behind this on T11121, but there isn't any there.

Change 175759 merged by jenkins-bot:
API: Allow generators to return data

https://gerrit.wikimedia.org/r/175759

In T75623#787290, @Mattflaschen wrote:

And JavaScript. indexpageids is meant to solve this issue (it does for most API modules but apparently doesn't in this case for other reasons).

Rather than preserving order, indexpageids is meant to make it easier to iterate over the pages hash for languages without native hash iteration (or with iteration that's not convenient to actually use).

My understanding is that it solves both (making sure you can iterate the hash in the right order by providing the IDs in a correctly ordered array).

From what I can tell, indexpageids is always in the same order as the hash keys.

Anomie added a comment.Dec 2 2014, 4:20 PM
In T75623#799592, @Mattflaschen wrote:

From what I can tell, indexpageids is always in the same order as the hash keys.

That part is true. But the order of the hash keys has never necessarily corresponded to the "input" order (whether from titles, pageids, revids, or generator). It corresponds to whatever order ApiPageSet processed things in, which in turn is generally based on the result order of a database query without explicit ORDER BY.

Deskana closed this task as Resolved.Dec 4 2014, 12:19 AM
Deskana claimed this task.

This seems to be resolved now. Thank you!