Page MenuHomePhabricator

Investigate API usage for Jio Set Top Box
Open, Needs TriagePublic

Description

Issue

The Jio Set Top Box is using images from Wikimedia Commons as screensaver images. The images are retrieved via the Action API. However, some images are of poor quality and/or of an undesirable size. Investigate what can be done to improve this.

Caveat: the analysis below involves parts of our stack that I haven't looked closely at before. I could be wrong about some things. Please correct me if you spot errors.

Analysis

Jio is using an API call of the following form to retrieve Commons images. Feel free to click that to see the returned JSON. It doesn't require authentication or anything.

https://commons.wikimedia.org/w/api.php?action=query&generator=categorymembers&format=json&gcmtype=file&gcmtitle=Category:Featured_pictures_on_Wikimedia_Commons&prop=imageinfo&gcmlimit=50&iiprop=url|extmetadata&iiurlwidth=1920&continue=&gcmcontinue=

Breaking down some notable parameters:

prop=imageinfo

This tells the Action API to invoke the ApiQueryImageInfo action. Looks good.

generator=categorymembers
gcmtitle=Category:Featured_pictures_on_Wikimedia_Commons

The above two parameters restrict the search to this category. I suspect that was done in order to get interesting, non-controversial images of good quality. That seems like a good decision.

iiurlwidth=1920

This parameter is intended to cause the API to return images that are 1920 pixels wide, to fit the needs of the set top box. Based on the documentation, that's the right thing to do. The docs say this:

iiurlwidth: If iiprop=url is set, a URL to an image scaled to this width will be returned. For performance reasons if this option is used, no more than 50 scaled images will be returned.

That all looks good. iprop=url is being set, as is gcmlimit=50. However, we see that the results include images that are not 1920 pixels wide. When I executed the call, here are two images that were returned (you may get somewhat different results, as the content on Commons is always changing):

https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/02.Trinidad_%2859%29.JPG/1920px-02.Trinidad_%2859%29.JPG
https://upload.wikimedia.org/wikipedia/commons/a/a2/04-09-12-Schaupfl%C3%BCgen-Fahrenwalde-RalfR-IMG_1232.jpg

The first of those images is 1920x1440 (good), but the second is 1600x1200 (bad, although I like the image and the guy looks like he's working really hard).

So why the discrepancy?

Let's look at the source images (which you can get from the descriptionurl value in the json response):

https://commons.wikimedia.org/wiki/File:02.Trinidad_(59).JPG
https://commons.wikimedia.org/wiki/File:04-09-12-Schaupfl%C3%BCgen-Fahrenwalde-RalfR-IMG_1232.jpg

The first, the "Trinidad" image, is originally 3,264 × 2,448 pixels. The second has an original file size of 1,600 × 1,200 pixels. Digging through the code, I found this little snippet, whose comment says:

Don't make an image bigger than the source

Well, rats. We can use this API endpoint to shrink images that are too wide down to 1920, but we can't use it to grow images that are too small up to 1920. This appears to me to be the root of the problem.

Suggestions

So, what do we do about this?

OPTION ONE: modify the client (set top) box side to skip over any images less than 1920 pixels wide in the response. This would mean each page of results might produce less than 50 images of sufficient width. It is theoretically possible that a page of results might produce no images of sufficient width,. However, a casual look through images in that category suggest the vast majority of images are wide enough, so it would be unlikely for a page to produce no sufficiently wide images.

OPTION TWO: modify the API endpoint to allow filtering by a minimum size. I suspect the way in which images are stored makes this technically infeasible - I don't see a good way to construct an efficient database query that would give the desired results. I doubt this option is feasible, and I mostly listed it in case someone can tell me I'm wrong.

OPTION THREE: allow smaller images to be widened. Again, I suspect we intentionally disallow this based on limitations of our image manipulation library, the fact that enlarging images reduces their quality, or both. However, I confess that this takes me into a part of our stack that I know very little about. I doubt this option is feasible, and I mostly listed it in case someone can tell me I'm wrong.

OPTION FOUR: use a different API, maybe MediaSearch, that gives happier results. I didn't see any options that would help after a look through the MediaSearch code, but I'll ask around to see if I'm missing anything.

OPTION FIVE: do something else that I haven't thought of yet.

I realize those options aren't terribly helpful yet. I'm going to ask around and see if developers more familiar with Commons image searches have further suggestions. I'll either update this task with the results of those conversations.

Note on tagging and subscribers: I tagged this as API Platform. Feel free to add other tags that may be relevant. I also subscribed the people from the email thread that I could find a Phab account for. Hopefully I got that right, please adjust as needed. I also subscribed Virginia, our new API Platform Product Manager, for visibility, but as it is her first week at WMF it would be unfair to expect her to do anything but follow along.

Event Timeline

I want to filter images that are 1920x1080, is there any api implementation that would just allow 1920x1080 and discard other variations for eg 1920x800, 1200x1080, 1920x940, essentially what i need is to get only images that are 16:9 or close to that ratio so that it would fit almost perfectly on Wide display screens

@Vineed_Gangadharan , as far as I have been able to determine, the answer to both is no.

One possibility for getting only images of desired size is to create a separate service, on Toolforge or elsewhere, that uses data dumps to correlate the necessary data and pregenerate a list of images of the desired size. There are then many ways this list could be made available to the set top boxes, including the service exposing an API. The service would not need to store or serve the actual images, just their url on Commons. Explicit images would still need to be manually identified. In this model, the service would maintain its own list of approved/excluded images in its own storage.

Another alternative is to create a category and add desired images to it directly on Commons. The API does allow searching for images by category. However, I am not sure what conventions surround creating such a category - that would be a question for the community that governs Commons. And tagging images would be a largely manual process (although it would be possible to construct a helper list/tool using images from dumps).

Technical links:
Developer portal: https://developer.wikimedia.org/
(This site points to other documentation. It is a great starting point for finding technical docs if you're not sure what you need)

MediaWiki site: https://www.mediawiki.org/wiki/MediaWiki
(This is the master site for the MediaWiki software itself, There's a lot of info there. Probably too much, it is sometimes hard to find exactly what you're after.)

Wikitech: https://wikitech.wikimedia.org/wiki/Main_Page
(This site covers Wikimedia Foundation technical infrastructure)

Toolforge: https://wikitech.wikimedia.org/wiki/Portal:Toolforge
(A hosting environment for projects related to Wikimedia data/infrastructure. This might be a place to run an external service if that turns out to be part of the solution. Be aware that anything posted here is open source and forkable.)

Dumps: https://wikitech.wikimedia.org/wiki/Dumps
(Massive amounts of data dumped from WMF wikis, including Commons)

Commons API Sandbox: https://commons.wikimedia.org/wiki/Special:ApiSandbox
(Sandbox page for Action API on Commons)

Content/Community links:
Commons community portal: https://commons.wikimedia.org/wiki/Commons:Community_portal
(Good starting point, but we can probably also put Jio people directly in touch with community members if needed)

Commons help page: https://commons.wikimedia.org/wiki/Help:Contents
(wide range of info about Commons)

I'll mention one more thing about this suggestion I made in a previous comment:

Another alternative is to create a category and add desired images to it directly on Commons. The API does allow searching for images by category. However, I am not sure what conventions surround creating such a category - that would be a question for the community that governs Commons. And tagging images would be a largely manual process (although it would be possible to construct a helper list/tool using images from dumps).

Anything like that, once created, would be accessible to anyone. That could be considered a positive (anyone on the planet could help you categorize things) or a negative (they may do so using criteria that you disagree with). Just be aware that's the nature of our sites. If you need something proprietary and tightly controlled, you'll probably want to do it off-wiki. There's no reason an external service running on AWS, Jio's servers, or wherever, can't make API calls to Commons or import WMF's data dumps.

I hope that helps. I'm happy to provide more specific links and/or answer more specific questions. If I don't know an answer or I'm the not the right person to talk to, I'm happy to help find the right person.

Per email discussion, I put together a rough little script that iterates over Commons images in the desired category and outputs a file of the images with the desired aspect ratio:
https://github.com/bpirkle/commons-screensaver-images

WARNING: I put this together late at night and didn't test it very well. I *think* it does what I claimed, but there could be errors/issues.

The script found 574 images of suitable size: https://gist.github.com/bpirkle/c508a2b732caf69f348e10a6596446ae

Nothing I did checks whether the images contain objectionable content. I'm not sure there's a way to do that automatically. But at least we have a list of candidate images that a human could review.

I'm out of office this week, but I'll check messages when I can to see if that was helpful and/or if there's anything else I can do.