Page MenuHomePhabricator

iiurlwidth seems to remove imageinfo
Closed, InvalidPublic

Description

When performing a query for the images on a wiki page, it seems that adding the iiurlwidth parameter causes a number of otherwise fine images to lose their imageinfo property. For example, compare the results from this query:

https://en.wikipedia.org/w/api.php?redirects=&titles=Montreal&generator=images&format=json&action=query&gimlimit=max&iiprop=url|dimensions|extmetadata|user&prop=imageinfo

To this one, which is the same except it contains iiurlwidth=800:

https://en.wikipedia.org/w/api.php?redirects=&titles=Montreal&generator=images&iiurlwidth=800&format=json&action=query&gimlimit=max&iiprop=url|dimensions|extmetadata|user&prop=imageinfo

Note that every image after the -38 image (File:Flag of the Philippines.svg) is now missing its imageinfo property, while the first link contains imageinfo for the same images.

Event Timeline

wgreenberg raised the priority of this task from to Needs Triage.
wgreenberg updated the task description. (Show Details)
wgreenberg subscribed.
Umherirrender set Security to None.

But more than 50 images are being returned, they're just devoid of metadata. Shouldn't the response JSON contain only 50 images instead of including empty results beyond that number?

Also, in the included example, only 38 images are being returned with imageinfo, not 50.

The result contains images up to the list limit. The limit for the list of images is 500 (or 5000 for sysop) therefore the result contains all images for the request. The extra information for the thumburl are just there for 50 images. The order where the imageinfo property is set is not the same order as the result (depending on order on database). A browser search for "thumburl" gives 50 hits, just scroll to the first non negative number. The 38 is just for commons images, but there are 12 local images which has imageinfo in the first request. Lower your list limit (gimlimit) to 50 to get imageinfo for all images in the result (and than continue)

Out of curiosity, why are some images with completely valid information given negative numbers and tagged with "missing": ""? That doesn't seem to be documented anywhere...

Out of curiosity, why are some images with completely valid information given negative numbers and tagged with "missing": ""? That doesn't seem to be documented anywhere...

The file description page is missing locally (and missing pages getting a negative number to have a key in the result). There are have "imagerepository": "shared" because the file exists on Commons.
That is normal api handling.

Is this documented anywhere? The API seems difficult and unintuitive to work with, and even the #mediawiki IRC channel thought these were bugs.

One of the examples contains a sentence about missing image pages:
https://www.mediawiki.org/wiki/API:Imageinfo#Examples

Note that the image page might be missing when the image exists on commons.

That a generator does not contains the properties in the "generator order" seems not noted any where
https://www.mediawiki.org/wiki/API:Query#Generators_and_continuation

Is this documented anywhere? The API seems difficult and unintuitive to work with, and even the #mediawiki IRC channel thought these were bugs.

@JasperStPierre: Could you file dedicated Phabricator tasks about non-intuitive MediaWiki-Action-API Documentation? Would be very welcome. Thanks in advance!

It's so broken (continuation not described anywhere as far as I can tell, limits poorly described in weird ways, important key concepts put into Examples sections), I wouldn't even know where to begin.

Another issue here -- we're not sure how to interpret the iicontinue parameter when it contains Unicode. Should we encode to UTF-8 and then percent-encode?

Simple example: https://en.wikipedia.org/w/api.php?redirects=&titles=Pel%C3%A9&generator=images&iiurlwidth=800&action=query&gimlimit=max&iiprop=url|dimensions|extmetadata|user&prop=imageinfo

Another issue here -- we're not sure how to interpret the iicontinue parameter when it contains Unicode. Should we encode to UTF-8 and then percent-encode?

In short: Yes.

You do the exact same thing you did for the 'titles' parameter in the example query, and what you would do for the 'text' parameter in action=edit and so on.

We simply copy/pasted links from Wikipedia and used the URLs directly for our data when we made the request. It seemed a bit awkward for the API not to give us back already-encoded proper URL fragments, but we'll fix that up in our code.

Also, even when we pass a continue parameter, we don't get thumbnail information. I take it it's because of the ~*~ completely undocumented ~*~ "filehidden" response field.

No, it's because the file content of the version is question has been revision deleted.

"filehidden" serves to indicate that the file content of the version in question was revision deleted.

We simply copy/pasted links from Wikipedia and used the URLs directly for our data when we made the request. It seemed a bit awkward for the API not to give us back already-encoded proper URL fragments, but we'll fix that up in our code.

Except for the part where you might be using the value in a GET query string, a POST application/x-www-form-urlencoded body, or a POST multipart/form-data, and all three of those need to be encoded differently.

So, why would we should a deleted file in our imageinfo request? It seems wrong to me to return a response that includes a deleted file.

Is there documentation for this filemissing stuff, or for the encodings in URLs, at all? I can't find any.

Perhaps you don't want the deleted file, but other people might. Filter it out in your client.

There isn't much documentation about the API response keys. If it were somewhere it'd be on https://www.mediawiki.org/wiki/API:Imageinfo, but see also T2001: [DO NOT USE] Documentation is out of date, incomplete (tracking) [superseded by #Documentation]

Encoding in the URL is basic web stuff, see RFC 3986.

filehidden is just possible for old images, so getting that parameter things may get wrong (because you just want the current images)

The problem is that iilimit is per image and not per request and that means you just need the default of iilimit=1, but than a continue param is provided to continue against the old images as well, that seems a bad case here.
This is not the case when having more than one image on a page, see https://en.wikipedia.org/w/api.php?redirects=&titles=Main%20page&generator=images&iiurlwidth=800&action=query&gimlimit=max&iiprop=url|dimensions|user&prop=imageinfo where no iistart is provided as continue parameter

Is there a way of not fetching old images, and only fetching current images? We only want the images that would be displayed on the page.

I'm not sure what iilimit is, exactly. All we want is the images that would be on a given Wikipedia page.

Is there a way of not fetching old images, and only fetching current images?
I'm not sure what iilimit is, exactly. All we want is the images that would be on a given Wikipedia page.

@JasperStPierre: https://www.mediawiki.org/wiki/API:Imageinfo says "iilimit: How many image revisions to return (1 by default)" but not sure if that helps?