Page MenuHomePhabricator

Update RESTBase page summary endpoint to return canonical "image" info for each "thumbnail"
Closed, ResolvedPublic

Description

Background:

The iOS (and Android!) apps display various icons in the feed.

For example, see the "Chandra Bose" cell below:

Screen Shot 2016-12-01 at 4.52.04 PM.png (1×862 px, 237 KB)

( From: https://en.wikipedia.org/api/rest_v1/feed/featured/2016/12/01 )

The problem is the feed only gives us the following url for the Bose icon:

"thumbnail": {
     "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Jagadish_Chandra_Bose_1926.jpg/240px-Jagadish_Chandra_Bose_1926.jpg",
     "width": 240,
     "height": 320
},

^ Note the "240px" width.

However, the iOS app is only *displaying* the Bose icon at 40x40 pixels!

This means we are fetching the following 17K 240px icon:

Jagadish_Chandra_Bose_1926.jpg (320×240 px, 16 KB)

But we really only need to be fetching the following 1K 40px icon:

Jagadish_Chandra_Bose_1926-1.jpg (53×40 px, 1 KB)

Fetching 17K when 1K would suffice may not seem like much, but most of the smaller icons in the iOS feed have the same issue, so the effect is compounded.

Solution:

The feed has an image section which returns not only a "thumbnail" but also the canonical "image" details:

"image": {
     "title": "File:Düsseldorf, Rheinturm -- 2015 -- 8127.jpg",
     "thumbnail": {
          "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/D%C3%BCsseldorf%2C_Rheinturm_--_2015_--_8127.jpg/640px-D%C3%BCsseldorf%2C_Rheinturm_--_2015_--_8127.jpg",
          "width": 640,
          "height": 640
     },
     "image": {
          "source": "https://upload.wikimedia.org/wikipedia/commons/5/5c/D%C3%BCsseldorf%2C_Rheinturm_--_2015_--_8127.jpg",
          "width": 3648,
          "height": 3648
     },
...
},

Having both thumbnail and the canonical image details means we can easily request any width variant we need because we know, in the case above, that the max width is 3648.

So the proposed fix is to update the feed so "tfa", "mostread", and "news" sections also include canonical "image" data everywhere they return "thumbnail" data - exactly as the current "image" section above does. Then any data client can safely request exactly needed resolutions for their presentation requirements.

( In this particular case I realize we could safely choose a lower resolution and, but I'd still prefer we consistently always return canonical info so we can safely choose lower and higher resolutions, as needed. )

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Mhurd updated the task description. (Show Details)
Mhurd updated the task description. (Show Details)
Mhurd updated the task description. (Show Details)
Mhurd updated the task description. (Show Details)

I just took a closer look at the query we use for the image section and I think we can do this by using an imageinfo rather than a pageimage request when compiling the RESTBase page summary response. (And, I think you're right that we might even get the width-limited rather than width-or-height-limited sizing for free...)

Mholloway renamed this task from Update Explore feed endpoint to return canonical "image" info for each "thumbnail" to Update RESTBase summary endpoint to return canonical "image" info for each "thumbnail" .Dec 2 2016, 4:26 PM
Mholloway renamed this task from Update RESTBase summary endpoint to return canonical "image" info for each "thumbnail" to Update RESTBase page summary endpoint to return canonical "image" info for each "thumbnail" .Dec 2 2016, 4:30 PM

I would advise sticking with the "default" set of most common widths, which are currently pre-rendered at upload time:

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L17535

Picking an unusual width for the request can drastically increase the risk of a complete miss. Which means rendering the thumbnail on the fly, going to origin in eqiad, etc.

So, yes, downloading an image closer to what you actually need is better for bandwidth usage, but if you pick something uncommon you'd be increasing latency.

That being said, the need for icon-sized images is unprecedented as far as I know, we could introduce a new tiny size to the set of pre-rendered sizes. Be aware, though, that this won't be backfilled and that for most images, it will trigger on-the-fly rendering (since obviously, the majority of images are already uploaded). If you go down that path, you should use the same width in Android and iOS, and pick something future-proof.

@Gilles Totally understand. Even though I said we could request any size we want, we'd actually be picking from these "bucket" sizes:

/**
 *  The smallest image width we will show, e.g. in search cell thumbnails.
 *
 *  There's no guarantee about image aspect ratio, so we fetch a little more and use aspect fill.
 */
WMFImageWidthExtraSmall = 60,
/**
 *  The next-smallest thumbnail we'll show, e.g. in nearby cell thumbnails.
 */
WMFImageWidthSmall = 120,
/**
 *  A medium width, e.g. POTD & lead images.
 */
WMFImageWidthMedium = 320,
/**
 *  A slightly larger width, e.g. modal gallery.
 */
WMFImageWidthLarge = 640

The problem we have now is we can't safely jump *between* these bucket sizes as needed because the feed isn't telling us the ceiling width.

While 320 and 640 are great picks for the reasons I've mentioned above, 120 and 60 particularly might have low performance that can easily go unnoticed. When you guys are testing things, you're probably visiting popular articles, or always the same selection. And therefore you're seeing images that are well cached, because someone browsed them through the app before.

Where performance really takes a hit when requesting unusual thumbnail sizes is the long tail of less visited articles. Articles where the person using the app may be the first one to consult on the app. That's when they get a complete miss that goes to origin and is rendered on the fly.

I would advise that you collect that information to figure out the ratio of misses you're getting in the wild, as it could inform either whether we need to "warm up" the image caches by rendering the small sizes the apps need for the existing corpus of images, or whether you could pick a more popular size. Or use 120 for both the 60 and 120 display. The response headers contain that information in the varnish fields.

Example of a complete miss:

age:0
x-cache-status:miss

The combination of both those headers indicates that the thumbnail had to be rendered. Age would be greater than zero if the image existed in Swift already, which is still an origin hit but at least the thumbnail wouldn't have been rendered on the fly as well.

And that was the 60px thumbnail for the current POTD on the Commons front page. Which goes to show that even an image that gets a lot of traffic can be a complete miss for the apps, because 60px is very unusual outside of the apps.

We could probably also extract the hit/miss ratio for those sizes from backend data, assuming that the iOS and Android apps have recognizable user agent strings.

I believe both the apps have been using 120px and 60px for years, so I'm not sure there's any issue there.

@Gilles are there other small and medium-ish icon sizes that you'd recommend? We can switch super easily if so.

So, to summarize, it appears we have 2 related issues.

  1. We need to know the canonical image size so we can safely/freely choose between bucket sizes (this ticket)
  2. We need to pick "good" bucket sizes which play nice with the caching infrastructure (per @Gilles)

Given that we need #1 to be able to do #2, I'd probably recommend we address #2 a separate follow-on ticket (or two - one for Android and one for iOS), if possible.

I believe both the apps have been using 120px and 60px for years, so I'm not sure there's any issue there.

It's an issue of breadth of traffic. The apps have low traffic in the grand scheme of things. But stats will give us the answer.

@Gilles are there other small and medium-ish icon sizes that you'd recommend? We can switch super easily if so.

There might not be anything better to suggest for very small stuff. I'll also look at that when investigating the hit/miss ratio and swift presence of 60 and 120. Like I've said, we might want to consider adding those to the pre-render list and backpopulating existing images to a degree.

Change 325044 had a related patch set uploaded (by Mholloway):
API: Surface dimensions when requesting original image info

https://gerrit.wikimedia.org/r/325044

@Gilles

Like I've said, we might want to consider adding those to the pre-render list and backpopulating existing images to a degree.

Makes sense. Hmm what size does the desktop portal use for it's search result thumbnails?

@Mholloway Are you sure you want to add "originalwidth" and "originalheight"? The reason to perhaps not do it that way is it is inconsistent with how the "image" section presents canonical image data - i.e. with separate "thumbnail" and "image" (canonical) keys.

@Mhurd That's so as not to break backward compatibility for existing PageImages API consumers. We can take that response and easily massage it in RESTBase and/or MCS so that the output we'd be consuming in the apps would match the one in the description here.

Edit: Disregard the comment below - Michael clarified the change I was commenting on isn't the one which actually controls the eventual endpoint output format.

@Mholloway Adding an "image" key everywhere there is a "thumbnail" key wouldn't break backward compatibility... unless I'm not understanding something... every consumer using the "thumbnail" key (and its url, width and height) could continue to do so, but if needed there would also be an "image" key (also with url, width and height) which they could check if they need info about the canonical image. (Just like the "image" section's "thumbnail" and "image".)

@Mholloway Adding an "image" key everywhere there is a "thumbnail" key wouldn't break backward compatibility... unless I'm not understanding something... every consumer using the "thumbnail" key (and its url, width and height) could continue to do so, but if needed there would also be an "image" key (also with url, width and height) which they could check if they need info about the canonical image. (Just like the "image" section's "thumbnail" and "image".)

For posterity, @Mhurd and I chatted about this on a Hangout. The structure the pageimages query emits wouldn't be the one ultimately reproduced in the feed response. I'm fine with any response format in the pageimages result that (a) is backward-compatible and (b) is agreeable to whomever owns PageImages (the Reading Web team, probably?).

Storing image data in thumbnail array concerns me a bit. It looks like we did some shortcuts some time and I'm afraid if we follow that path we might create bigger technical debt. Thumbnail is a simple object, it shouldn't be associated with original media. I really liked the initial idea where we had different keys for thumbnail and for image but from code I see we're already passing the "orignal" key and from comments I assumed that we're already expecting that property in other projects. Because of that have to follow the dirty way to be backward-compatible, even it adds more "not-object-related" properties to object definition.

I would introduce the original key:

"original": {
     "source": "https://upload.wikimedia.org/wikipedia/commons/5/5c/D%C3%BCsseldorf%2C_Rheinturm_--_2015_--_8127.jpg",
     "width": 3648,
     "height": 3648
},

For now I would leave the old original key inside thumbnail to be backwards compatible, just leave a notice that this is deprecated and annotation what to use from now. Then in some time we could remove the original from thumbnail and make everyone happy again.
Also as a small clean up I would initialize $vals['thumbnail'] = []; just after if ( $file ) {. The vals['thumbnail'] property will be created anyway (if we enter that if https://github.com/wikimedia/mediawiki-extensions-PageImages/blob/master/includes/ApiQueryPageImages.php#L189 thumbnail prop will be created or in first if, or in second if/else). Current solution creates uncessary complexity

Why I call it original not image -> because maybe in future we might serve thumbs for video/other media, then image would be confusing [also project name PageImages could be confusing but this is different story]. My idea is to have an single thumbnail interface we could use in future for everything/all media.

Thanks for the review, @pmiazga! I'll make changes as you've outlined. I agree that having an "original" property appear within the "thumbnail" object is probably not the choice I would have made initially.

Also, just to be clear, I'm not specifically aware of anyone using the "original" property, and for one reason or another it's not mentioned in the extension documentation. It's just been my experience that if it exists, someone, somewhere is relying on it...

+1 to having a separate object for the original image dimensions.

The property name original may be fine for the level of abstraction of PageImages. On the abstraction level of the Mobile Content Service, where we expose thumbnails for lead images[1] it would make more sense to call it image or originalImage since it's unclear what just original would refer to. In either case we're talking about dimensions of something visible that a thumbnail can be made of, whether it's an image or a video. If this is about an audio file then I think there should be no PageImage results unless there is album art associated with it (should be quite rare for Commons).
I just wanted to mention this so there would not be any surprises down the road.

[1] for the hero image of the article or a smaller thumbnail to be shown when the page shows up in a list: history, search, trending, news, ...

Audio files also might have a thumbnail -> a waveform

An important issue to consider when introducing data about original media dimensions, is that for multipage documents (PDF, DJVU, some TIFFs), original dimensions can vary per page. When needing to figure out the maximum width for a specific page on a PDF, knowing the cover page's dimensions is of no use, as it might be unrelated to the page you want to generate the thumbnail for.

Since multipage documents can be hundreds of pages long, this can make the description of all original dimensions quite large for those documents. That being said, in your case, I think you'd only be interested in a different size of the same thumbnail, i.e. dealing with the same page of a multipage document. So maybe returning original dimensions of the same page as the thumbnail is sufficient.

Change 325044 merged by jenkins-bot:
API: Surface dimensions when requesting original image info

https://gerrit.wikimedia.org/r/325044

@Dbrant @bearND can you sign this off (e.g. resolve) if you now consider this fixed?

There's still work to be done in RESTBase to get this info into the summaries once the PageImages change goes live next week, so we shouldn't close it just yet.

When is PageImages API change going to be deployed? This probably requires a change in RESTBase's summary endpoint to include the original property, too.

@bearND According to the release tag on 2017-01-24

To expand a bit on the deploy timing, per the regular deployment cadence the change will go out to the testwikis on 1/24 (Tues), to Hebrew and Catalan WP on 1/25 (Wed), and to the rest of the Wikipedias on 1/26 (Thurs).

@bearND any time frame for implementing this functionality in RESTBase?

The MW API change will be deployed 1/26, we can't implement it in RESTBase before that's deployed. As soon as MW API change lands I will develop the RESTBase part. Assigning to myself to avoid loosing track of it.

Yep, I will prioritize this, so expect it to be done and deployed on 1/26 after the MW train.

The group2 wikis are still running 1.29.0-wmf.8 as the deployment of -wmf.9 was blocked. See T154683: MW-1.29.0-wmf.9 deployment blockers and its subtasks for more detail.

The RESTBase changes has been deployed. The summaries will be regenerated on each request that reaches RESTBase, but first Varnish caches has to expire.

Change 387321 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/extensions/PageImages@master] Hygiene: Remove legacy 'original' property from 'thumbnail' output

https://gerrit.wikimedia.org/r/387321

Change 387321 merged by jenkins-bot:
[mediawiki/extensions/PageImages@master] Hygiene: Remove legacy 'original' property from 'thumbnail' output

https://gerrit.wikimedia.org/r/387321