Page MenuHomePhabricator

Define an official thumb API
Open, NormalPublic

Description

See also:


MediaWiki never had an official API for the retrieval of thumbnailed images. Existing thumb URLs are considered private, and are not designed for use as a public API. Instead, clients are supposed to request a URL individually from the action API. This complicates API design and API response caching whenever the response references a thumb, and makes it difficult to dynamically select image size or quality in a client.

The move to content-based addressing could provide a good opportunity to address this by defining a simple and stable thumb API.

Use cases / problem statement

  • Dynamic client-side thumb size / quality selection without extra round-trips: Clients would like to adapt image size and quality to actual network and device characteristics. Most current techniques for lazy-loading and dynamic resource selection are using JS. Having the ability to select the thumb size and quality in this code without extra round-trips would enable interesting performance optimizations.
  • Caching of API responses referencing thumbnails: Many cached API responses (such as summary or related pages) contain references to thumbnails. Currently, there is no way for clients to sanely select thumb sizes, which means that we either need to fragment caches on thumb size, or try to include several sizes in the response. Good caching of API responses is becoming more pressing as high-traffic features like hovercards (see T70860) are built as direct API consumers. The need to negotiate thumb sizes supported by API end points introduces delays, and reduces the design flexibility in clients.
  • Image info caching: Parsoid and other clients like VisualEditor currently need to request image information in order to render thumbnails. Those requests are very common, making imageinfo one of the most-used API entry points. Currently, the request necessarily contains the desired thumb size, which renders caching of imageinfo responses ineffective. If the response was dimension-independent instead, most of these responses could be served directly from caches. This would reduce response latency and load on the API cluster and related infrastructure.

API requirements

  • Simple selection of thumb size and -quality without a need for extra API calls.
  • Avoid cache fragmentation with deterministic URLs.
  • Support for encoding complex options in a uniform and extensible manner, without breaking existing use cases or introducing non-determinism.
  • Optional support for content negotiation (ex: client hints) in the future.
  • Support migrating to hash-based image identification in a later stage.

API proposal: Use query strings

  • /v1/someimage.jpg: Original. Returned by API end points referencing thumbnails.
  • /v1/someimage.jpg?w=220: JPG thumb 220px wide.
  • /v1/someimage.jpg?p=22&w=500: 500px thumb of page 22 in a multi-page document.
  • /v1/someimage.jpg?t=2m30s&w=220: Thumb of a video at 2m30s.
  • /v1/someimage.jpg?lang=fr&w=220: Thumb of an SVG, rendered to a PNG using French texts. We aren't explicitly mentioning the file format, so the client does not need to know that the original is an SVG, or that it is rendered to a PNG image.

Server side requirements:

  • Audit and document the existing query string API in thumb.php. T153497
  • Add strict parameter validation. Each thumb should have only a single URL. Don't allow unknown parameters, and (generally) avoid specifying default values explicitly. Exceptions can be made for page & time offset parameters, where little actual fragmentation is expected, and consistency in the use of the parameter is important.
  • Query string order normalization in Varnish (vmod): T138093

Providing original dimensions to clients

In order to accurately calculate and select thumbnail dimensions, clients need to know the original image's dimensions (where applicable). Using this informations, clients can then construct a unique URL for a given thumbnail size, independent of the constraints it applied to select this size. They can also avoid content jumping around by updating image dimensions to the exact thumbnail dimensions, before the thumbnail has loaded.

Currently, MediaWiki already provides original dimensions in data-file-{width,height} attributes for MediaViewer's benefit. Some API responses referencing thumbnails include the equivalent information in JSON:

image: {
  src: "/2fd4e1c67a2d28fced849ee1bb76e7391b93eb12",
  width: 640,
  height: 480
}

We can either stick to these different formats, or consider unifying this information in the URL, either in query parameters (/someimage.jpg?oh=768&ow=1024&w=100), or separately in a fragment (/someimage.jpg?w=100#oh=768&ow=1024). The latter avoids sending back purely informational parameters to the server.

Pros

  • Familiar query string syntax with wide parsing support.
  • Does not distinguish between advanced & frequently changed properties.

Cons

  • Requires custom ordered query string serialization code for both simple (size, quality) & more complex use cases.
    • Subtlety of ordering requirement (still works) means that users will often ignore it, causing client side cache fragmentation.
  • Need for general query string normalization in Varnish. Weak con, as this would be generally useful.

Options for content negotiation and -selection

We would like to use modern thumbnail formats where this has a benefit for users, but need to make sure that we don't break older clients with insufficient support in the process.

The two main approaches for this content negotiation process are:

a) Server-side HTTP content negotiation, using client supplied headers like accept, and
b) client-side JS explicitly requesting specific formats.

There are pros & cons to either method, and they aren't mutually exclusive. HTTP content negotiation can work without any client-side effort for bitmap formats (ex: Chrome advertises WebM support). Client side JS can add explicit parameters for more fine-grained control, but also has only limited information to base such decisions on.

Either way, the status quo and starting point is to serve widely supported formats (JPG and PNG) by default. We don't need to solve this question right away, and the proposed API is leaving all options for content negotiation open.

Deployment strategy

A change like this isn't complete without a strategy that allows us to roll out new-style thumbs gradually. To avoid performance impacts, all (simple) requests for a thumb of a given size should map to the same cache entries using either URL scheme. To this end, we can roll things out in a way that lets us *rewrite* one URL scheme to the other.

  • New to old style:
    • Simple thumbs (only width parameter specified):
      1. Prefix the image name with the width parameter value followed by 'px-'.
      2. Calculate the MD5 of the image name, and prefix first & first two chars of hex encoding to path (ex: /7/72/).
    • Complex thumbs: Let PHP code handle the request & cache the response separately.
  • Old to new style:
    • Simple thumbs (need to check if we can determine this with a regex):
      1. Extract & strip width parameter
      2. Send a request with a query string to backend
    • Complex thumbs: Let PHP code handle the request & cache the response separately.

Users / apps making assumptions about the current thumb URL format

See T153498.

Migration strategies in Varnish

  • Feasibility of rewriting majority of "simple" thumbnails (no key-value parameters) in Varnish.
  • Feasibility of avoiding redirect latency penalty by resolving redirect responses from thumb service in Varnish.

Small MediaWiki installs

Currently, MediaWiki defaults to serving thumbnails directly from an upload directory. This means that there is no PHP code involved in serving thumbnails. This is good for performance (especially without caching), but also means that on-demand generation & a parameter-based API cannot be supported out of the box. There are several options we can pursue:

  1. Start to serve all thumbs through thumb.php (or an API module). The migration to this would generally be easy (especially with the API), but we would add the overhead of serving thumbs through PHP. Authentication would be supported out of the box. Caching could eliminate the performance issue for higher volume installs, or in an appliance container install that includes Varnish.
  2. Create a new way of supporting direct file serving with 404 handler and storage based on encoded query strings. This requires an advanced web server configuration, and might not be possible with less common web servers.

At this point, the default option would be 1), but we can always optimize the setup with 2). We can provide 2) by default in a container-based distribution solution.

Details

Reference
bz64214

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Tgr added a comment.Feb 1 2017, 12:17 AM

One piece of feedback we have been discussing is having a higher level abstraction in addition to specifying actual width.

A similar proposal on the wikitext side is T90914: Provide semantic wiki-configurable styles for media display. IMO it's an interesting issue but too complicated and vaguely understood to include in the current RfC which has largely orthogonal concerns anyway.

Yurik removed a subscriber: Yurik.Feb 1 2017, 12:18 AM
In T66214#2988550, @Tgr wrote:

One piece of feedback we have been discussing is having a higher level abstraction in addition to specifying actual width.

A similar proposal on the wikitext side is T90914: Provide semantic wiki-configurable styles for media display. IMO it's an interesting issue but too complicated and vaguely understood to include in the current RfC which has largely orthogonal concerns anyway.

Agreed. Such a higher-level abstraction could always be implemented in a client side library, which would select the right pixel size based on the desired high-level thumbnail type (and possibly other criteria, like device properties).

Agreed. Such a higher-level abstraction could always be implemented in a client side library, which would select the right pixel size based on the desired high-level thumbnail type (and possibly other criteria, like device properties).

I was one of the engineers that discussed this higher level of abstraction with @Fjalapeno. I do believe that the Reading Web team should pursue this abstraction in the projects they maintain, which are predominantly clients, and continue to discuss pushing the abstraction further down the stack when it's better understood.

In T66214#2988550, @Tgr wrote:

One piece of feedback we have been discussing is having a higher level abstraction in addition to specifying actual width.

A similar proposal on the wikitext side is T90914: Provide semantic wiki-configurable styles for media display. IMO it's an interesting issue but too complicated and vaguely understood to include in the current RfC which has largely orthogonal concerns anyway.

I don't think its a problem addressing outside of the current RFC.

Agreed. Such a higher-level abstraction could always be implemented in a client side library, which would select the right pixel size based on the desired high-level thumbnail type (and possibly other criteria, like device properties).

I would like to consider not putting this type of logic in a client side library. One of the advantages of sharing an API is to prevent clients form needing to implement the same logic in multiple languages/platforms.

Tgr added a comment.Feb 8 2017, 3:58 AM

I think the third-party MediaWiki concerns are somewhat understated and are not actually third party as we use the same arrangement via Swift (sorry, should have spotted that sooner).

This is the status quo:

  • default MediaWiki setup: thumbnails are only rendered on parse, client requests for arbitrary thumbnail sizes (arbitrary being anything that's not already present in the article) are not possible. Thumbnails are referred to by URLs which mirror the filename (generated by MediaHandler::makeParamString) and directory structure, so setting up the web server is as easy as configuring the image directory on disk to be the webroot for <wiki root URL>/images. This would probably have to be deprecated if we move along with the RfC (there is not point to an API which might or might not work).
  • everything through thumb.php: apart from a little syntactic sugar, this is what the RfC is going for. This is a possible configuration currently although I don't think it is used much. Every image request gets streamed through PHP (or served by some external cache like Varnish).
  • 404 handling: like the default, but requests for non-existent images are sent to thumb.php via the web server's 404 handler, and get converted from filenames back into proper key-value querystrings (via MediaHandler::parseParamString), then saved to disk so the next request to the same URL gets served by the web server directly. This is probably the common setup for non-small wikis.
  • Wikimedia setup: basically same as 404 handling, with two extra layers on top: Varnish and Swift. Varnish caches URLs and is mostly agnostic of how thumbs are served. Swift (a distributed file storage) is used the same way as the disk in the previous example: the thumb URL uses a path/filename which can be represented in a filesystem, and in case of a Varnish miss, the "web server" (in this case a small python script) looks up the file in Swift using that path; if that fails, it forwards the request to thumb.pgp which transforms it into a nice query URL, renders the thumbnail, saves it to Swift then streams it.

So the proposal as it is now would duplicate thumbnail storage in Swift and degrade performance in third-party wikis that don't use a reverse proxy. It makes more sense to keep the current parameter-array -> filename-on-disk mapping and document it as standard (it's ugly and ad hoc but there are seven parameters altogether in all known extensions and almost all are mutually exclusive so meh), and reimplement it in Thumbor and the Swift router script.

GWicke added a comment.EditedFeb 8 2017, 9:37 PM

@Tgr, the concerns you raise are primarily about the implementation, and not really about the API. I think it is important to separate the two, and avoid relatively minor implementation concerns dictate our API design. Public APIs will stay around & affect a lot of users, while implementations can be replaced transparently.

We can definitely avoid duplication of the vast majority of thumbnails by rewriting "simple" thumbnails in Varnish (see migration strategy section). Similarly, thumbnail serving performance for small third party installs is something we can optimize by mapping requests to file names, for example as part of the container-based distribution project.

Edit: I forgot to mention that client side size selection won't work with purely static file serving in any case. At a minimum, a 404 handler would need to be configured, which is something we can do for either API, but is more complex than using thumb.php.

cscott added a subscriber: cscott.Feb 15 2017, 6:58 PM

I'll pitch T90914: Provide semantic wiki-configurable styles for media display as a better solution to the "name for different thumbnail sizes", as it generalizes that requirement and lets wikis come up with names for arbitrary groups of image properties. But I agree that should be mostly abstracted from the thumbnail service. The one exception is that the current thumbnail code arbitrarily quantizes sizes in order to reduce cache overhead. I'm not sure that is necessary to pull forward into the new service, but presumably the original quantization was added to reduce a particular cache-exhaustion DoS. Limiting the size query to a limited set of "thumbnail sizes" (or "media styles") might be one way of avoiding a blowup in the number of different versions of a thumbnail.

But better would probably be to be semi-intelligent about cache strategy, so that 1,000 different versions requested of a certain image each requested once, (a) don't necessarily push other images out of the cache (cache slots are per-image, not shared), and (b) don't necessarily push out the highest-frequency cached thumbnails of that image (ie, not just LRU strategy).

GWicke updated the task description. (Show Details)Feb 21 2017, 11:34 PM

Something that's missing in the current plan, however, is the swift sharding information that is currently part of the thumb URL in production.

Do you mean the md5 of the file name? Calculating this on the fly is not that expensive. OpenSSL does 7248416 md5s over 16 byte data per second on my machine, which works out to about 0.1ms per md5.

You might not have access to the file contents when you generate a URL. What I'm saying is that you can't just drop that information from the URL, Swift needs it in some form, to hit the right shard. Because it, too, can't guess the contents of the original when you ask it to look for a thumbnail in a certain shard. And when the thumbnail is missing, it also needs to hit the right shard to fetch the original.

While some wikis don't have sharding for their originals/thumbnails storage and don't need that information, any attempt at defining a new API should take that information into account. Right now none of the new URL examples provided in the task description contain storage sharding information.

Gilles added a comment.EditedFeb 22 2017, 8:27 PM

Ah nevermind, it's only the md5 of some form of the filename, not the file contents? I've only been working with the consuming side, not the producing side of that hash.

I remember that the temp path was an exception based on the filename, but didn't realize that the general hash was based on the filename as well.

Not sure why we don't just compute it in Swift's rewrite.py instead of passing it along the whole pipepline. At any rate, you can ignore my previous remark.

Tgr added a comment.Feb 23 2017, 2:54 AM

The one exception is that the current thumbnail code arbitrarily quantizes sizes in order to reduce cache overhead. I'm not sure that is necessary to pull forward into the new service, but presumably the original quantization was added to reduce a particular cache-exhaustion DoS. Limiting the size query to a limited set of "thumbnail sizes" (or "media styles") might be one way of avoiding a blowup in the number of different versions of a thumbnail.

The current API prefers some sizes over others but does not limit clients (IIRC sizes are rounded to the closest multiple of 5px but that's not much of a limitation). That preference manifests in three ways:

  • certain sizes are pre-rendered at upload ($wgUploadThumbnailRenderMap)
  • for huge images, a thumbnail of some intermediary size is used as the source for scaling, not the original (see $wgThumbnailBuckets)
  • thumbnail rendering for non-standard sizes is throttled more aggressively.

Mostly this is meant to improve performance for clients which limit themselves to predefined sizes (MediaWiki HTML pages and MediaViewer); to a smaller extent to help against accidental DOS by a combination of bot uploads of huge files + getting lots of thumbnailing requests in parallel when someone visits a category/gallery of said files. Cache exhaustion attacks were AFAIK not considered.
The cache invalidation in Varnish is currently not intelligent about standard vs. nonstandard sizes (and in Swift there is no invalidation - every thumbnail is kept forever), but if that were feasible it would certainly be nice.

I think the discussion about restricting thumbnail sizes is orthogonal to this RFC. Nothing in this RFC limits our ability to later a) prefer specific sizes in a client side library, or even b) enforce the use of a limited set of sizes by returning a standard size instead of the non-standard requested size.

GWicke updated the task description. (Show Details)Feb 23 2017, 6:39 PM
GWicke added a comment.EditedFeb 23 2017, 6:41 PM

Accept headers and Vary: Accept are missing from the current task description.

I added a section on this, but proposed to defer this question for now. As far as I am aware we don't need content negotiation right away, and the API proposal leaves all options open for a follow-up RFC.

Tgr added a comment.May 11 2017, 8:10 PM

We'll also need a way to display old versions of images. Clients can encounter old versions without expecting to due to FlaggedRevs hiding unreviewed image changes.

In T66214#3256693, @Tgr wrote:

We'll also need a way to display old versions of images. Clients can encounter old versions without expecting to due to FlaggedRevs hiding unreviewed image changes.

T149847: RFC: Use content hash based image / thumb URLs is designed to provide this ability, but it looks like that will still take a while. MediaWiki currently seems to prefix a timestamp to request old revisions, as in 20161203102219%21Kyoto_Station_November_2016_-02.jpg.

Either version will work with this API, since the "name" portion is treated as an opaque identifier.

GWicke moved this task from next to watching on the Services board.Jul 12 2017, 10:37 PM
GWicke edited projects, added Services (watching); removed Services (next).
ssastry moved this task from Backlog to Non-Parsoid Tasks on the Parsoid board.Jan 11 2018, 9:49 PM
kchapman added a subscriber: kchapman.

TechCom is declining because the use case is not current. This needs a new owner and use case.

There is actually one current use case still , which is the idea of having the thumb urls be versioned instead of “current” (ref. Performance-Team), but we intend to address that at a later point with a different proposal.

The current thumbnail URL scheme could easily start including a revision number or sha1 of the original without changing the format.

TechCom is declining because the use case is not current.

Everything listed in the task's description under "Use cases / problem statement" seems current to me. See also T66214#1842437.

kchapman moved this task from Declined to Inbox on the TechCom-RFC board.Mar 21 2018, 2:14 PM

Thanks @Anomie my information might be old. Moving to TechCom-RFC Inbox for discussion.

kchapman moved this task from Inbox to Last Call on the TechCom-RFC board.Mar 22 2018, 11:45 PM

TechCom discussed this at our last meeting. The problem statement is still valid, but that doesn't mean it needs to be kept open as there is currently no resourcing for this. If there is still interest in this issue it could be used as material for a new RFC but the new RFC should contain one problem statement. Note: T149847: RFC: Use content hash based image / thumb URLs has already been broken out into a single issue.

We are moving this to last call to be declined closing on 2018-03-29 at 1 pm PST(21:00 UTC, 22:00 CET

@kchapman we are interested in picking this up in Reading Infrastructure, but haven't been able to get to it. We would still like to do this if we can find some time… I should know more in Q4 about feasibility/timing.

For context: We have lots of client code with work arounds for getting the right sizes of images. So much duplication and bugs. We want to get rid of this code in the client and instead use this much more flexible proposed API.

Tgr added a comment.Mar 23 2018, 6:42 PM

The problem here is that TechCom is using Phabricator in a different way from the rest of the movement.

The normal way is that you create one task for one concept / task, and multiple groups share that task and do their workflow management in such a way that it does not conflict with that of other groups. That means using projects or workboards (since there can be any number of those but there is only one task status). Declining a task means that it was decided that it should not be done (ie. people should be actively prevented from doing it) because it is a bad idea. Most tasks are not resourced but kept open (or stalled) nevertheless.

So if TechCom insists on their current workflow, it should make it clear that it prefers a different workflow, and for every idea that goes through the TechCom process there should be a separate idea task and a separate RfC task so that the TechCom can decline the RfC task when there is no resourcing, while the idea task can be kept open as something that's potentially still valid; and the existing RfC tasks should be split in two. Or TechCom should change their workflow to match that of everyone else on Phabricator, and use a workboard column or the removal of the TechCom tag or something similar for tracking "rejected without prejudice".

In either case please keep this task open, whether it gets resourced in the near future or not. The use cases described here are still valid, the solution proposed here still captures our best understanding of the solution space, declining it would be confusing and hamper the use of Phabricator as a technical knowledge management system.

@Tgr perhaps I was not as clear as I could have been. The other issue we see is there should be multiple RFCs broken out for that. Perhaps that means this is not an RFC, but an overall task that has RFCs linked to it.

I will bring up the process in the next TechCom meeting.

Tgr added a comment.Mar 27 2018, 10:52 PM

The other issue we see is there should be multiple RFCs broken out for that. Perhaps that means this is not an RFC, but an overall task that has RFCs linked to it.

I would disagree, I don't think this RfC can be meaningfully broken up. There are a few ideas it mentions but does not actually propose to do (hash-based identification, content negotiation) and those could be removed for clarity, but that's all I can see.

But regardless, even if the task is not a real RfC, that's not a good reason to decline it as a Phabricator task. Probably the RfC project tag should just be removed in that case.

kchapman moved this task from Last Call to Declined on the TechCom-RFC board.Mar 29 2018, 1:08 PM

@Tgr we are just putting it in the Declined TechCom-RFC workboard, not in Phabricator as a whole. For reference, this is how we approach declining RFCs now: https://phabricator.wikimedia.org/T184653

TechCom is declining at this time but will be more than happy to discuss further RFCs on this topic in the future (noting that @Fjalapeno mentioned interest in picking this up).

Tgr added a comment.Mar 30 2018, 2:08 AM

Ugh, I am really sorry, I don't know how I could misread that so badly :( I guess my eye is trained to react to the word "Declined" in Phabricator emails, without reading them properly.

@Jdrewniak shared a pointer to the https://cloudinary.com API which I thought could be a source of inspiration for this task. Be sure to check out their 57 second demo.

MaxSem removed a project: Zero.Jan 3 2019, 11:42 PM
Krinkle updated the task description. (Show Details)