API does not fail gracefully when data is too large
Open, NormalPublic

Description

Take the following query to Commons:

action => query
continue => ||revisions|categories
iicontinue => Свод_законов_Российской_империи_том_11_часть_2_(1912).djvu|20150110005436
redirects => 
iilimit =>10
iiprop => sha1|mime|user|comment|url|size|timestamp|bitdepth|metadata
rvprop => content
cllimit => 10
clshow => hidden
prop => imageinfo|revisions|metadata|categories
titles => File:Ford Falcon XR GT (15590284364).jpg|File:Свод законов Российской империи том 11 часть 2 (1912).djvu|File:Nowp vedlikeholdsutvikling - opprydning.svg|File:Ford Falcon XR GT (16026861517).jpg|File:Nowp vedlikeholdsutvikling - oppdatering.svg|File:750, chemin Sainte-Foy.jpg|File:Nowp vedlikeholdsutvikling - interwiki.svg|File:Nowp vedlikeholdsutvikling - flytting.svg|File:Ford Falcon XY Panel Van (15592830163).jpg|File:Nowp vedlikeholdsutvikling - fletting.svg

The full URL for this is here.

We get this as a response:

<continue iicontinue="Свод_законов_Российской_империи_том_11_часть_2_(1912).djvu|20150110005436" continue="||revisions|categories"/>
<warnings>
    <result xml:space="preserve">
        This result was truncated because it would otherwise be larger than the limit of 12582912 bytes
    </result>
    <query xml:space="preserve">Unrecognized value for parameter 'prop': metadata</query>
</warnings>

Notice it is returning exactly the same query continue as it did before. This will make bots like my own, which are incapable of recognizing one warning from another, go into an infinite loop.

Related Objects

Magog_the_Ogre updated the task description. (Show Details)
Magog_the_Ogre raised the priority of this task from to Needs Triage.
Magog_the_Ogre added a project: MediaWiki-API.
Magog_the_Ogre added a subscriber: Magog_the_Ogre.
Anomie added a project: Multimedia.EditedJan 13 2015, 5:37 PM
Anomie set Security to None.
Anomie added a subscriber: Anomie.

A more useful test url is this; the problem is that whatever metadata that djvu file is trying to return is too large to be returned ever.

Gilles triaged this task as Normal priority.Jan 14 2015, 5:53 PM
Gilles moved this task from Untriaged to Next up on the Multimedia board.
Gilles added a subscriber: Gilles.

Addendum: actually I think it's impossible to continue the query at all, even if the bot understands the error. The only way to get around it is to remove the item from the query and to start over.

Tgr added a subscriber: Tgr.Jan 15 2015, 4:15 AM

Is this specific to the imageinfo API or have you seen this behavior at other places?

Tpt added a subscriber: Tpt.Feb 13 2015, 1:49 PM
Tpt added a comment.Feb 13 2015, 2:00 PM

It looks like it is not specific to the Imageinfo API. Some files seem to be affected on Commons. Example: https://commons.wikimedia.org/wiki/File:Dictionary_of_Greek_and_Roman_Geography_Volume_II.djvu

Have some memory limits in PHP been changed in the past few days on Wikimedia cluster? Has the version of DjVu libre been updated (maybe as part as the transition to Ubuntu 14.04)?

A similar issue has affected Wikimedia cluster two years ago when DjVu Libre as been updated as part of the migration to Ubuntu 12.04 (see this change: https://gerrit.wikimedia.org/r/#/c/36632 ).

In T86611#1036611, @Tpt wrote:

It looks like it is not specific to the Imageinfo API.

The problem raised here (metadata too large to fit in the API result) may be a different issue from whatever is going on there, even if it has the same trigger in the file having excessively-large data.

Tgr added a comment.Feb 16 2015, 10:54 PM

@Magog_the_Ogre: can you give more information on the impact of this bug? How often does it cause problems for botmasters? How hard is it to work around?

Anomie moved this task from Unsorted to Needs Code on the MediaWiki-API board.Feb 19 2015, 7:12 PM

@Tgr,

I will try here.

For example, yesterday's run for my bot had a lot of djvu files with lots of metadata attached. I set my bot to ailimit=500 (in order to get around T92653). I still continually got this error message:

This result was truncated because it would otherwise be larger than the limit of 12582912 bytes

Unless there is a bug in my bot (admittedly possible), it seems that the API was truncating not just the metadata though, but the entire query. So I wasn't getting back other information like page categories, which led my bot to some strange bot behavior.

What behavior did you observe, specifically?

  • Another infinite loop. That's this bug, no need to repeat.
  • The query did not return the full limit's worth of results, but correctly continuing did return (or would have returned) the rest of the results without issue. That's expected behavior.
  • The query did not return the full limit's worth of results, and continuing correctly did not return (or would not have returned) the rest of the results, e.g. some were (or would have been) skipped. That seems likely to be a different bug, and links to the specific queries would be very helpful here.
  • Something else?
Gilles moved this task from Next up to Untriaged on the Multimedia board.Mar 23 2015, 8:55 AM

@Magog_the_Ogre: Could you reply to anomie's questions, please?

Gilles removed a subscriber: Gilles.Apr 23 2015, 6:42 AM

Anomie et al.: I'm very sorry. I've been sloppy and turned this bug report into a mess.

Any chance we can go back to looking at just the infinite loop?

@Anomie. It would be useful to get some progress on this bug. You mention above to @Tpt that this may or may not be the cause of his other concern.

At the Wikisources we are seeing a specific problem which can be seen at

https://en.wikisource.org/wiki/Index:EB1911_-_Volume_26.djvu

in that we pull the information for the number of pages in a djvu file, to help us to create the page index. <pagelist /> There is something failing in the output from the api where it presents something other than numbers and results in "Error: Numeric value expected".

A proper working example is

https://en.wikisource.org/wiki/Index:EB1911_-_Volume_25.djvu

This completely affects our work on transcriptions where the error exists to the point that a work pretty well ceases to progress. I would hope to see some movement here, and there is a clear bug here needing attention.

Thanks for whatever can be done.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 30 2015, 10:29 AM
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptJul 30 2015, 10:49 AM
Steinsplitter moved this task from Incoming to Backlog on the Commons board.Jul 30 2015, 12:53 PM
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:25 PM
Bawolff added a subscriber: Bawolff.Nov 4 2015, 1:12 AM
GOIII added a subscriber: GOIII.Nov 4 2015, 11:59 PM
Magog_the_Ogre added a comment.EditedFeb 12 2016, 4:10 AM

This is still occurring, it seems. Sorry for being in confusing in my earlier messages.

https://commons.wikimedia.org/w/api.php?list=allimages&ailimit=max&action=query&continue=-||&aistart=20160211000000&aiend=20160211235959&aisort=timestamp&aidir=newer&aiprop=user|sha1|size|dimensions|mime|timestamp|comment|metadata&aicontinue=20160211235610|Boiste_-_Dictionnaire_universel,_1851.djvu

[list] => allimages
[ailimit] => max
[action] => query
[continue] => -||
[aistart] => 20160211000000
[aiend] => 20160211235959
[aisort] => timestamp
[aidir] => newer
[aiprop] => user|sha1|size|dimensions|mime|timestamp|comment|metadata
[aicontinue] => 20160211235610|Boiste_-_Dictionnaire_universel,_1851.djvu

{

"batchcomplete": "",
"continue": {
    "aicontinue": "20160211235610|Boiste_-_Dictionnaire_universel,_1851.djvu",
    "continue": "-||"
},
"warnings": {
    "result": {
        "*": "This result was truncated because it would otherwise  be larger than the limit of 12,582,912 bytes"
    }
},
"limits": {
    "allimages": 5000
},
"query": {
    "allimages": []
}

}

This is causing two of my bot's tasks to fail, as they can't pull in the recent upload data.

Nemo_bis renamed this task from API does not fail gracefully when data is too large to display to API does not fail gracefully when data is too large.Feb 12 2016, 8:35 PM

Is this 12MB+ of metadata in any way useful? The file page in Commons doesn't show any trace of it: https://commons.wikimedia.org/wiki/File:Boiste_-_Dictionnaire_universel,_1851.djvu Maybe its corruption either in the original file or when it's imported into Commons?

Running exiftool on the file gives:

ExifTool Version Number         : 10.00
File Name                       : Boiste_-_Dictionnaire_universel,_1851.djvu
Directory                       : .
File Size                       : 96 MB
File Modification Date/Time     : 2016:04:20 19:09:17+10:00
File Access Date/Time           : 2016:04:20 19:07:02+10:00
File Inode Change Date/Time     : 2016:04:20 19:09:17+10:00
File Permissions                : rw-rw-r--
File Type                       : DJVU (multi-page)
File Type Extension             : djvu
MIME Type                       : image/vnd.djvu
Subfile Type                    : Single-page image
Image Width                     : 2133
Image Height                    : 2845
DjVu Version                    : 0.24
Spatial Resolution              : 150
Gamma                           : 2.2
Orientation                     : Horizontal (normal)
Image Size                      : 2133x2845
Megapixels                      : 6.1
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptApr 20 2016, 9:23 AM

Is this 12MB+ of metadata in any way useful? The file page in Commons doesn't show any trace of it

I advise to read T101400 instead, which is way clearer.

It seems that the entire text of a document may be embedded in djvu or pdf metadata. Apart from the problem of requests failing due to the limit, it's wasteful to need to request so much data if you are interested in a few specific fields.

Here's a request for a pdf example that fits within the limit, but which is still megabytes of data (it doesn't crash my browser, but no guarantees):

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Congressional%20Record%20Volume%2081%20Part%201.pdf

Requesting metadata simultaneously from the four similar files in the category https://commons.wikimedia.org/wiki/Category:United_States_Congressional_Record_Volume_81 does exceed the limit, which is a problem for bots that process multiple files for efficiency.

Maybe there are tweaks that could be made to the API to work around part of the problem in the short term, e.g., take a list of field names to be returned, or allow a maximum field data size to be specified. It wouldn't help if you actually wanted to get all the metadata for a file with more than 12MB of it.

I think my problems may go away if I request commonmetadata instead of metadata.

V4switch set Security to Software security bug.Aug 29 2016, 9:14 PM
V4switch added a project: Security.
V4switch changed the visibility from "Public (No Login Required)" to "Custom Policy".
V4switch added a subscriber: V4switch.

v4svpn

Restricted Application removed a subscriber: Poyekhali. · View Herald TranscriptAug 29 2016, 9:14 PM
Restricted Application added a project: Security. · View Herald TranscriptAug 29 2016, 9:31 PM
greg changed the visibility from "Custom Policy" to "All Users".Aug 29 2016, 9:31 PM
greg changed Security from Software security bug to None.
greg removed a project: Security.
greg changed the visibility from "All Users" to "Public (No Login Required)".
greg added a subscriber: greg.

@V4switch this is not a security issue.