Page MenuHomePhabricator

page generators can truncate responses when there is excessive metadata (e.g. DjVu/PDF files)
Closed, ResolvedPublic

Description

I am using the CategorizedPageGenerator method on a Wikimedia Commons categories to get a list of all its members (and, recursively, subcat members). I am using it on 'Category:Media_contributed_by_the_Digital_Public_Library_of_America' with the following code:

cat = pywikibot.Category(site, 'Category:Media_contributed_by_the_Digital_Public_Library_of_America')
for file in pagegenerators.CategorizedPageGenerator(cat, recurse=True, namespaces='6'):
    [does stuff]

When I run this, I the following warning repeatedly:
WARNING: API warning (result): This result was truncated because it would otherwise be larger than the limit of 12,582,912 bytes.

I am afraid this means I cannot ever access the full results set, and, presumably, anyone trying to use page generators for page sets that include large/many PDFs or DJVu files also will never be able to access all the pages. I see something similar has been reported at T195992, but that task is a bit confusing, because the reporter appears to have been trying to exclude files from the query anyway, and just wanted category names. I actually do want all files.

The discussion at T101400 is clarifying, since it seems the cause of this warning is likely that there can be a large amount of data returned when iiprop=metadata is requested for, for example, a PDF with a text layer—or, in the case of a Commons category, potentially 500 of them at once (I'm assuming it requests the max, by default?).

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently. Since T89971 has been around for years and appears stalled, I wonder if there is a way to solve this in Pywikibot. For example, if it receives this warning, could Pywikibot back up and use successive smaller gcmlimit (or whatever method it is using does) values until it gets under the 12MB response limitation? Or, could there be a filter option to turn off image metadata, if it is not actually necessary for the user's needs?

Event Timeline

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently.

CategorizedPageGenerator does not load the content by default. To get it you have to use content=False as parameter or use an explicit cat.get() later.

there can be a large amount of data returned ... in the case of a Commons category, potentially 500 of them at once

By default the maximum query increment of data is retrieved which depends on the membership of user groups. You may decrease the maximum query increment by the step parameter in your user-config.py; this parameter is also available as global -step option if your script calls pywikibot.handle_args() for all command line options refer the basic.py sample script for that.

@Dominicbm: could yo verify whether decreasing the query increment has any effect. It could help to implement a generic solution for this issue.

This comment was removed by Dominicbm.

The problem is that, for my use case, I really just want page titles, but I guess since Pywikibot wants to generate all the page objects using all the metadata, there is no way around this error currently.

CategorizedPageGenerator does not load the content by default. To get it you have to use content=False as parameter or use an explicit cat.get() later.

It's not the page content, but the image metadata. So, if I look at the request in debug mode, even without content=True, since metadata is always included in iiprop, as you can see below, you might get an over-12MB response for certain media files.

API request to commons:commons (uses get: False):
Headers: {'Content-Type': 'application/x-www-form-urlencoded'}
URI: '/w/api.php'
Body: 'gcmtitle=Category%3AMedia+contributed+by+the+Digital+Public+Library+of+America&gcmprop=ids%7Ctitle%7Csortkey&gcmtype=page%7Cfile&prop=info%7Cimageinfo%7Ccategoryinfo&inprop=protection&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata&iilimit=max&generator=categorymembers&action=query&indexpageids=&continue=&gcmnamespace=6&gcmlimit=1&meta=userinfo&uiprop=blockinfo%7Chasmsg&maxlag=5&format=json'

I put step lower at 50, a s you suggested, and was still getting the warning. So now I am trying with 1, but presumably there will always be a possibility of an API response that is too large as long as the metadata is always retrieved, since there is nothing on the Wikimedia end that prevents a single media file's metadata from exceeding 12MB. Consider the case where you have a 500 page PDF with an OCR text layer, for example. Here is a single file in which the API response amounts to almost 3MB because of all the image metadata. A single file 4 times that size, or just 4 such files in a single set of results, and you're getting truncated responses.

I am also concerned what the performance implication will be for being forced to use such a low query increment, since I am applying it across the whole results set and not just the requests that were truncated. The category I was attempting to run this on is expected to have at least 700,000 members, for example. Changing step for the whole thing drastically increases the number of requests required to complete it.

I guess what I was asking for was an argument similar to the content= one you mentioned, but where I can do img_metadata=False. If I am understanding the cause here, this would solve the problem for me and anyone else with the issue (unless you really want the image metadata, but then you're out of luck in a way Pywikibot can't solve, without T86611 getting fixed).

Update: using the default command, the generator takes a 2-3 hours to complete for this large category. Putting step at 1 and it was still unfinished 2 days later when I killed it, so that wasn't really feasible. I can experiment with other lower increments between 1 and 50, but I already know enough to know the only values low enough to prevent truncation and lost pages will make the operation take too long to complete, since 50 was already very slow and still getting truncated.

I think we have the following options (which are TBC):

  1. not have metadata by default but only on request
  2. repeat the request with the warning decreasing step in half until it succeeds
  3. repeat the request with the warning removing metadata

I'd prefer to be able to turn the metadata off if I don't need it, because it an enormous amount of data for no purpose if you're not interested in it, and backing off:

  • will slow things down due to retries
  • might not be needed (e.g. if one file is really big and the rest are tiny, so you back off down to 1, and you didn't have to for the rest of the files)
  • still doesn't save you if that one file's metadata is still to big

Given that for many of these files the metadata WOULD include the substantial text layer that is only of relevance for specific tools (like Fae's copyright message hunter on Commons) or Proofread Page at Wikisource, would it be technically feasible to move the text-layer to being something that is NOT supplied unless explicitly requested, meaning that having to request explicitly would mean changes to a small number of tools as opposed to having to rewrite many if the default was to exclude the metadata entirely by default?

Alternatively as has been suggested earlier, an option for content/title pairs only would be useful...

An alternative off the wall thought...

How feasible would it be to have an option to have a pywikibot pagegenerator module, that utilises a pre-existing PETSCAN? (Or PagePile)

A relevant Petscan query (example:https://petscan.wmflabs.org/?psid=18645382) should only contain the 'titles' needed, bypassing the need to access the image metadata at all...

Change 817318 had a related patch set uploaded (by Mpaa; author: Mpaa):

[pywikibot/core@master] [IMPR]: do not load metadata by default for imageinfo

https://gerrit.wikimedia.org/r/817318

Change 826242 had a related patch set uploaded (by Xqt; author: Mpaa):

[pywikibot/core@master] [IMPR]: lazy load imageinfo metadata

https://gerrit.wikimedia.org/r/826242

Change 980007 had a related patch set uploaded (by Mpaa; author: Mpaa):

[pywikibot/core@master] [IMPR]: lazy load imageinfo metadata

https://gerrit.wikimedia.org/r/980007

Change 980007 merged by Xqt:

[pywikibot/core@master] [IMPR]: lazy load imageinfo metadata

https://gerrit.wikimedia.org/r/980007

Xqt claimed this task.
Xqt reassigned this task from Xqt to Mpaa.

Change 817318 abandoned by Mpaa:

[pywikibot/core@master] [IMPR]: do not load metadata by default for imageinfo

Reason:

Done here https://gerrit.wikimedia.org/r/c/pywikibot/core/+/980007

https://gerrit.wikimedia.org/r/817318

Change 826242 abandoned by Xqt:

[pywikibot/core@master] [IMPR]: lazy load imageinfo metadata

Reason:

https://gerrit.wikimedia.org/r/826242

I have the exact same problem. I just needed an easy way to loop through PDF files in a Category and it cannot be done because of this bug... Why wasn't the patch implemented?

The patch was done with rPWBCaa683df and shipped with Pywikibot 8.6.
@Mpaa: Isn't this solved?

@Ninovolador : What is your Pywikibot release? Could you please describe this issue a bit more because there was a change made in Pywikibot 8.6 which should have been solved it.