Page MenuHomePhabricator

API: Purpose of "batchcomplete" property and implementation
Closed, ResolvedPublic

Description

Implemented in https://gerrit.wikimedia.org/r/#/c/152359/

Intended purpose or use is not clear.

Provided release note:

Simplified continuation will return a "batchcomplete" property in the result
when a batch of pages is complete.

Also, it seems to be added to responses for action=watch:

{
  "batchcomplete":"",
  "watch":[
    {"title":"Main Page","unwatched":"","message":"<p>The page \"<strong class=\"selflink\">Main Page</strong>\" has been removed from <a href=\"/wiki/Special:Watchlist\" title=Special:Watchlist>your watchlist</a>.\n</p>"}
  ]
}

Are clients expected to do something with this property?

If a client is using continuation, should they do something other than passing back the given continue value until exhausted?

I assume it doesn't let them influence the continuation, but one use case I can imagine is the user not being interested in part of the result and they might top following the continuation chain after batchcomplete and thus not bother starting the generator. Is that the (only) intended use?

Event Timeline

Krinkle assigned this task to Anomie.
Krinkle raised the priority of this task from to Needs Triage.
Krinkle updated the task description. (Show Details)
Krinkle added a project: MediaWiki-Action-API.
Krinkle changed Security from none to None.
Krinkle subscribed.

The purpose of the property is to let clients know when a batch of pages is complete. It's mainly useful along with generators, but for consistency it works the same way any time continuation is a possibility. Note that action=watch can use generators to supply the list of pages to be watched, and while it isn't likely to ever run into an incomplete batch the consistency is probably a good idea.

For example, a query with generator=allpages and prop=links|linkshere might return 10 pages from the generator on the first query, but only return the links for the first three and half the links for the fourth, and return backlinks for the first five (but "first" sorted by namespace+title rather than by page_id). The next continuation could return more links for the fourth without finishing (if that page has lots of links) and backlinks for two more, and so on. Eventually all the links and all the backlinks for all of the 10 pages would be output, at which point "batchcomplete" is set to indicate that it's safe for the client to process these 10 pages without worrying that a future continuation will return additional data for any these pages. The continuation from the response with "batchcomplete" will begin returning data for the next set of 10 pages from the generator.

The client doesn't have to pay any attention to "batchcomplete" if it wants to slurp in the entire result set, but if the result set is going to be very large this may use a prohibitive amount of memory or other resources.

With the "raw" continuation where the client has to know when to ignore the continuation data returned for the generator module, this sort of batching was also expected to be handled on the client side. The new simplified continuation takes away the need for clients to know about the generator complexity, but that also means that clients have to be told about the batching.

It's not clear what the desired outcome is from this task. Is this just an instance of T2001?

Aklapper triaged this task as Lowest priority.Dec 29 2014, 10:43 PM

I'm going to tag this as Documentation and good first task and put it up for grabs. All it needs is someone expanding the text on mw:API:Query#Continuing queries to explain this.

I'm going to tag this as Documentation and good first task and put it up for grabs. All it needs is someone expanding the text on mw:API:Query#Continuing queries to explain this.

I added a sentence there mentioning it, but it could probably use improvement by someone who has a better idea how much detail to copy from T84977#936343.

@Anomie I have added an example here to expand on the existing explanation.

I think intended use of batchcomplete makes sense now. But it still seems confusing to me on write-requests where the client cannot "continue" even if they wanted to.

As such, it seems the only purpose of batchcomplete on action=watch is to tell you whether or not you got all the information you asked for. However, when the answer is no, you don't have continue-based method of getting the remaining information. Is that right?

I think intended use of batchcomplete makes sense now. But it still seems confusing to me on write-requests where the client cannot "continue" even if they wanted to.

As such, it seems the only purpose of batchcomplete on action=watch is to tell you whether or not you got all the information you asked for. However, when the answer is no, you don't have continue-based method of getting the remaining information. Is that right?

I couldn't find a way to get an incomplete batch using action=watch. So I'm also not sure if there's a continue-based method of getting the remaining information on write-requests. Maybe @Anomie can clarify?

https://www.mediawiki.org/wiki/API:Query#Example_5:_Batchcomplete mentions https://en.wikipedia.org/w/api.php?action=query&list=allimages&ailimit=3&format=jsonfm&formatversion=2 as an example of batchcomplete.

It's confusing for me. How will the response look like if the batch is not complete? and how is the client supposed to combine the results from such an incomplete batch?

I can see how it works in

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=langlinks&lllimit=40&utf8

and then

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=langlinks&lllimit=40&utf8&utf8&llcontinue=15580374|th&continue=||

i.e. for every common page id between the first and second query, the client has to combine their langlinks.

Q1: Could this happen for more than one item? Or is it always the last item in the incomplete response and the first item in the next response?

Q2: In the langlinks case, we have key-value pairs that are rather simple to combine; however for allimages query, the result is a list of mappings. There is no obvious key to compare and match incomplete pieces. Does that mean that every query will need a unique design for handling batch results?

But it still seems confusing to me on write-requests where the client cannot "continue" even if they wanted to.

Sure they can. The generator itself can and does issue continuation, as can be seen at for example https://www.mediawiki.org/wiki/Special:ApiSandbox#action=purge&format=json&generator=allpages&gaplimit=2.

It's possible that a write action might itself issue continuation, for example if it enforces a maximum time limit per query. I don't know of any that currently do this, but it could be done easily enough. In that case, batchcomplete would be false, and the module would presumably provide a mechanism for continuing.

https://www.mediawiki.org/wiki/API:Query#Example_5:_Batchcomplete mentions https://en.wikipedia.org/w/api.php?action=query&list=allimages&ailimit=3&format=jsonfm&formatversion=2 as an example of batch complete.

It's confusing for me. How will the response look like if the batch is not complete?

That's probably not the greatest example, since a single list module is never going to be incomplete.

Try something like https://en.wikipedia.org/w/api.php?action=query&generator=allimages&gailimit=3&prop=fileusage&fulimit=2&format=jsonfm&formatversion=2 instead.

and how is the client supposed to combine the results from such an incomplete batch?

By merging the results. For query prop modules, that generally means for each page object in the result you merge the arrays of revisions/links/etc.

I can see how it works in

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=langlinks&lllimit=40&utf8

and then

https://en.wikipedia.org/w/api.php?action=query&titles=Main%20Page&prop=langlinks&lllimit=40&utf8&utf8&llcontinue=15580374|th&continue=||

i.e. for every common page id between the first and second query, the client has to combine their langlinks.

Yes, like that.

Q1: Could this happen for more than one item? Or is it always the last item in the incomplete batch and the first item in the complete batch?

Note there can be multiple incomplete batches. Try your langlinks example with a lower lllimit.

Each prop module produces its results in an order. That order may not match the order of other prop modules, or of the generator. It probably will result in "filling" the page objects in some order, completing one page entirely before going on to the next, but even that isn't guaranteed.

Q2: In the langlinks case, we have key-value pairs that are rather simple to combine; however for allimages query, the result is a list of mappings. There is no obvious key to compare and match incomplete pieces. Does that mean that every query will need a unique design for handling batch results?

As I said, allimages is a bad example since it's a list module and so can't have incomplete batches. You can merge multiple complete batches by just appending the top-level array or map/dictionary.

Prop modules are the ones that normally need merging. Here the top level under query is an array or map/dictionary of "page objects" (i.e. a map/dictionary representing one page, with the page ID, namespace, and title). The prop modules generally produce their results as an array or map/dictionary that is a property of the page object, and those get merged by appending too.

More questions ... :)

In the prop=langlinks example, the arrays of query.pages should be merged.

But in allimages example, the mappings inside query.pages should be merged.

  1. Why does allimages need to return incomplete values? Wouldn't the process be simpler without them? I mean, without those values one could just use a simple array_merge operation, but as it is now, the items of previous incomplete arrays should be matched against the latest array (perhaps using the pageids), and then the matched mappings should be merged.
  1. Are other types of mergers to be expected? For example in the allimages query or something similar, could the splitted batch results require merging over both query.pages and query.pages.fileusage?
Krinkle closed this task as Resolved.EditedJun 28 2020, 5:29 PM

I'm closing this since the bug or confusion I perceived has been clariified with the API making more sense to me now. For support with the API, consider asking the support desk or the mediawiki-l list (see https://www.mediawiki.org/wiki/Communication).