Page MenuHomePhabricator

Support standard MediaWiki API continuation in wbsearchentities module / wbsearch list/generator
Open, Needs TriagePublic

Description

As an API user, I want to use the action=wbsearchentities, action=query&list=wbsearch or action=query&generator=wbsearch APIs like any other MediaWiki API, using the standard continuation framework introduced around MediaWiki 1.25/1.26, in order to avoid having to write custom code.
As a Wikidata Query Service user, I want MWAPI to automatically follow wbsearchentities continuation, in order to work with more search results.

Problem:
SearchEntities’ current continuation support (introduced in I03991a2921, reinstated in I28a3d7aca4) predates the current MediaWiki API continuation framework (ApiBase::getContinuationManager() and various ApiContinuationManager methods); instead, it returns a search-continue property and expects API users to turn that into the continue parameter for the next request. This means that API clients that support automatic continuation (e. g. API Sandbox, python-mwapi, or MWAPI) don’t support it for action=wbsearchentities, nor is continuation possible at all when using wbsearchentities with action=query (as list or generator), since the search-continue is not exposed in that case.

Example:
Open [action=query&list=wbsearch&wbssearch=and in the API sandbox](https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&list=wbsearch&wbssearch=and) and make the request. Compare with [action=query&list=search&srsearch=and](https://www.wikidata.org/wiki/Special:ApiSandbox#action=query&format=json&list=search&srsearch=and): it has a “continue” button at the bottom to automatically apply continuation and make the next request. With entity search, this currently doesn’t work.

This query, based on one provided by User:Thomas.lumen on the Contact the development team page, fails to find works such as Q208002 or Q127367, because they’re not in the first set of search results and there’s no continuation for later search results.

Acceptance criteria:

  • action=wbsearchentities supports continuation.
  • action=query&list=wbsearch supports continuation.
  • action=query&generator=wbsearch supports continuation.
NOTE: This is an incompatible change to a stable interface and should be announced in accordance with the Stable Interface Policy. (To ease the transition, we could support both continuation methods for a short period.)

Event Timeline

I looked a bit more into this, and it turns out that SearchEntities doesn’t support continuation all that well – basically, it asks the underlying search backend for offset + limit + 1 results, then returns the [offset, offset+limit) slice of that. Clearly, this isn’t very efficient for larger and larger offsets, which is why the API won’t return offsets higher than the standard API limit (50) for continuation (source). However, it won’t stop you from specifying larger limits yourself, potentially asking the search backend for arbitrarily large numbers of results.

Fortunately, this isn’t actually a denial-of-service vulnerability, because both search backends cap the limit: ElasticSearch to 10,000 (reference), wb_terms to 2500 (source). (Depending on backend, you might get some more results than that if you search in a non-English language and don’t specify the strictlanguage parameter, in which case the search may be retried in fallback languages.) Also, in the wb_terms-backed search, that limit doesn’t directly correspond to the SQL LIMIT anyways, because getTopMatchingTerms() always searches for 2500 terms in the database and then sorts and limits them in PHP. ElasticSearch, meanwhile, supports a separate offset parameter, and we really should be using it instead of just adding our offset to the limit we pass into the search backend.

We don’t have to fix this as part of this task, but if we don’t, then continuation isn’t really appealing: we can duplicate what EntitySearch does right now, and artificially abort the continuation at the standard API limit (50), in which case it’s no better than just specifying limit=max (in fact, if you’re a bot, that would get you more results, as your max is higher); or we can always offer continuation, and requests will start taking longer and longer (in production, action=wbsearchentities&search=e&language=en&continue=9999 takes between 10 and 15 seconds to return the final result).

I think we should make use of ElasticSearch and DB capabilities for handling offsets... Though I am not sure it would not do the same thing underneath, at least we won't be loading extra results we don't need.

Well, for the DB it wouldn’t make a difference because it always gets 2500 results anyways (and then sorts and limits them in PHP – I guess applying an offset here would save a bit of memory, discarding the unneeded terms earlier), but we definitely should pass the offset to ElasticSearch, yeah.

Am I right that there is currently no way to get continuation via MWAPI from SPARQL at all ? (cf MWAPI docs, Wikibase/API docs , wbsearchentities docs)

For example, here is a query to get items with labels containing the names of historic English counties: https://w.wiki/5BpN (with the intention of then restricting it to items for food or drink).

There seems to be no way to get more than the first 50 returns per county ?

Am I right that there is currently no way to get continuation via MWAPI from SPARQL at all ?

No, that’s not correct. The query service’s MWAPI supports standard MediaWiki API continuation just fine. (See also the Pagination section of the MWAPI docs you linked.) It’s the entity search API that doesn’t support it, and therefore this is a task about wbsearchentities, not about MWAPI.

Thanks for that clarification, Lucas, that's useful. So yes, I can use the "search" API instead: https://w.wiki/5BsM and successfully retrieve far more entries (??? albeit very slowly -- query took almost 100 seconds, just for 500 returns); but then I cannot restrict the search to just the labels of items. (I can restrict the search to titles of pages -- but for wikidata the titles appear to be Q-numbers, so that doesn't help). So I still can't get the items I need.

((Note -- I now see that the specific issue of not being able to get continuations of MWAPI actually has its own bug, T229291; but as Lucas notes there, this bug is the underlying issue)).

((2. I'm also wondering whether the general issue of not being able to get EntitySearch to do everything that standard search can (of which this bug is one aspect) is also the real issue underlying T235496 -- ie the lack of being able to do a "nearmatch" search in any kind of search for a wikidata label))