Page MenuHomePhabricator

page generators can truncate responses when there is excessive metadata (e.g. DjVu/PDF files)
Open, Needs TriagePublic

Description

I am using the CategorizedPageGenerator method on a Wikimedia Commons categories to get a list of all its members (and, recursively, subcat members). I am using it on 'Category:Media_contributed_by_the_Digital_Public_Library_of_America' with the following code:

cat = pywikibot.Category(site, 'Category:Media_contributed_by_the_Digital_Public_Library_of_America')
for file in pagegenerators.CategorizedPageGenerator(cat, recurse=True, namespaces='6'):
    [does stuff]

When I run this, I the following warning repeatedly:
WARNING: API warning (result): This result was truncated because it would otherwise be larger than the limit of 12,582,912 bytes.

I am afraid this means I cannot ever access the full results set, and, presumably, anyone trying to use page generators for page sets that include large/many PDFs or DJVu files also will never be able to access all the pages. I see something similar has been reported at T195992, but that task is a bit confusing, because the reporter appears to have been trying to exclude files from the query anyway, and just wanted category names. I actually do want all files.

The discussion at T101400 is clarifying, since it seems the cause of this warning is likely that there can be a large amount of data returned when iiprop=metadata is requested for, for example, a PDF with a text layer—or, in the case of a Commons category, potentially 500 of them at once (I'm assuming it requests the max, by default?).

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently. Since T89971 has been around for years and appears stalled, I wonder if there is a way to solve this in Pywikibot. For example, if it receives this warning, could Pywikibot back up and use successive smaller gcmlimit (or whatever method it is using does) values until it gets under the 12MB response limitation? Or, could there be a filter option to turn off image metadata, if it is not actually necessary for the user's needs?

Event Timeline

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptMay 25 2020, 9:14 PM
Xqt added a subscriber: Xqt.May 26 2020, 7:50 AM

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently.

CategorizedPageGenerator does not load the content by default. To get it you have to use content=False as parameter or use an explicit cat.get() later.

there can be a large amount of data returned ... in the case of a Commons category, potentially 500 of them at once

By default the maximum query increment of data is retrieved which depends on the membership of user groups. You may decrease the maximum query increment by the step parameter in your user-config.py; this parameter is also available as global -step option if your script calls pywikibot.handle_args() for all command line options refer the basic.py sample script for that.

@Dominicbm: could yo verify whether decreasing the query increment has any effect. It could help to implement a generic solution for this issue.

This comment was removed by Dominicbm.

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently.

CategorizedPageGenerator does not load the content by default. To get it you have to use content=False as parameter or use an explicit cat.get() later.

It's not the page content, but the image metadata. So, if I look at the request in debug mode, even without content=True, since metadata is always included in iiprop, as you can see below, you might get an over-12MB response for certain media files.

API request to commons:commons (uses get: False):
Headers: {'Content-Type': 'application/x-www-form-urlencoded'}
URI: '/w/api.php'
Body: 'gcmtitle=Category%3AMedia+contributed+by+the+Digital+Public+Library+of+America&gcmprop=ids%7Ctitle%7Csortkey&gcmtype=page%7Cfile&prop=info%7Cimageinfo%7Ccategoryinfo&inprop=protection&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata&iilimit=max&generator=categorymembers&action=query&indexpageids=&continue=&gcmnamespace=6&gcmlimit=1&meta=userinfo&uiprop=blockinfo%7Chasmsg&maxlag=5&format=json'

I put step lower at 50, a s you suggested, and was still getting the warning. So now I am trying with 1, but presumably there will always be a possibility of an API response that is too large as long as the metadata is always retrieved, since there is nothing on the Wikimedia end that prevents a single media file's metadata from exceeding 12MB. Consider the case where you have a 500 page PDF with an OCR text layer, for example. Here is a single file in which the API response amounts to almost 3MB because of all the image metadata. A single file 4 times that size, or just 4 such files in a single set of results, and you're getting truncated responses.

I am also concerned what the performance implication will be for being forced to use such a low query increment, since I am applying it across the whole results set and not just the requests that were truncated. The category I was attempting to run this on is expected to have at least 700,000 members, for example. Changing step for the whole thing drastically increases the number of requests required to complete it.

I guess what I was asking for was an argument similar to the content= one you mentioned, but where I can do img_metadata=False. If I am understanding the cause here, this would solve the problem for me and anyone else with the issue (unless you really want the image metadata, but then you're out of luck in a way Pywikibot can't solve, without T86611 getting fixed).

Update: using the default command, the generator takes a 2-3 hours to complete for this large category. Putting step at 1 and it was still unfinished 2 days later when I killed it, so that wasn't really feasible. I can experiment with other lower increments between 1 and 50, but I already know enough to know the only values low enough to prevent truncation and lost pages will make the operation take too long to complete, since 50 was already very slow and still getting truncated.

Mpaa added a subscriber: Mpaa.EditedThu, Oct 1, 8:11 PM

I think we have the following options (which are TBC):

  1. not have metadata by default but only on request
  2. repeat the request with the warning decreasing step in half until it succeeds
  3. repeat the request with the warning removing metadata