See also:
MediaWiki never had an official API for the retrieval of thumbnailed images. Existing thumb URLs are considered private, and are not designed for use as a public API. Instead, clients are supposed to request a URL individually from the action API. This complicates API design and API response caching whenever the response references a thumb, and makes it difficult to dynamically select image size or quality in a client.
The move to content-based addressing could provide a good opportunity to address this by defining a simple and stable thumb API.
Use cases / problem statement
- Dynamic client-side thumb size / quality selection without extra round-trips: Clients would like to adapt image size and quality to actual network and device characteristics. Most current techniques for lazy-loading and dynamic resource selection are using JS. Having the ability to select the thumb size and quality in this code without extra round-trips would enable interesting performance optimizations.
- Caching of API responses referencing thumbnails: Many cached API responses (such as summary or related pages) contain references to thumbnails. Currently, there is no way for clients to sanely select thumb sizes, which means that we either need to fragment caches on thumb size, or try to include several sizes in the response. Good caching of API responses is becoming more pressing as high-traffic features like hovercards (see T70860) are built as direct API consumers. The need to negotiate thumb sizes supported by API end points introduces delays, and reduces the design flexibility in clients.
- Image info caching: Parsoid and other clients like VisualEditor currently need to request image information in order to render thumbnails. Those requests are very common, making imageinfo one of the most-used API entry points. Currently, the request necessarily contains the desired thumb size, which renders caching of imageinfo responses ineffective. If the response was dimension-independent instead, most of these responses could be served directly from caches. This would reduce response latency and load on the API cluster and related infrastructure.
API requirements
- Simple selection of thumb size and -quality without a need for extra API calls.
- Avoid cache fragmentation with deterministic URLs.
- Support for encoding complex options in a uniform and extensible manner, without breaking existing use cases or introducing non-determinism.
- Optional support for content negotiation (ex: client hints) in the future.
- Support migrating to hash-based image identification in a later stage.
API proposal: Use query strings
- /v1/someimage.jpg: Original. Returned by API end points referencing thumbnails.
- /v1/someimage.jpg?w=220: JPG thumb 220px wide.
- /v1/someimage.jpg?p=22&w=500: 500px thumb of page 22 in a multi-page document.
- /v1/someimage.jpg?t=2m30s&w=220: Thumb of a video at 2m30s.
- /v1/someimage.jpg?lang=fr&w=220: Thumb of an SVG, rendered to a PNG using French texts. We aren't explicitly mentioning the file format, so the client does not need to know that the original is an SVG, or that it is rendered to a PNG image.
Server side requirements:
- Audit and document the existing query string API in thumb.php. T153497
- Add strict parameter validation. Each thumb should have only a single URL. Don't allow unknown parameters, and (generally) avoid specifying default values explicitly. Exceptions can be made for page & time offset parameters, where little actual fragmentation is expected, and consistency in the use of the parameter is important.
- Query string order normalization in Varnish (vmod): T138093
Providing original dimensions to clients
In order to accurately calculate and select thumbnail dimensions, clients need to know the original image's dimensions (where applicable). Using this informations, clients can then construct a unique URL for a given thumbnail size, independent of the constraints it applied to select this size. They can also avoid content jumping around by updating image dimensions to the exact thumbnail dimensions, before the thumbnail has loaded.
Currently, MediaWiki already provides original dimensions in data-file-{width,height} attributes for MediaViewer's benefit. Some API responses referencing thumbnails include the equivalent information in JSON:
image: { src: "/2fd4e1c67a2d28fced849ee1bb76e7391b93eb12", width: 640, height: 480 }
We can either stick to these different formats, or consider unifying this information in the URL, either in query parameters (/someimage.jpg?oh=768&ow=1024&w=100), or separately in a fragment (/someimage.jpg?w=100#oh=768&ow=1024). The latter avoids sending back purely informational parameters to the server.
Pros
- Familiar query string syntax with wide parsing support.
- Does not distinguish between advanced & frequently changed properties.
Cons
- Requires custom ordered query string serialization code for both simple (size, quality) & more complex use cases.
- Subtlety of ordering requirement (still works) means that users will often ignore it, causing client side cache fragmentation.
- Need for general query string normalization in Varnish. Weak con, as this would be generally useful.
Options for content negotiation and -selection
We would like to use modern thumbnail formats where this has a benefit for users, but need to make sure that we don't break older clients with insufficient support in the process.
The two main approaches for this content negotiation process are:
a) Server-side HTTP content negotiation, using client supplied headers like accept, and
b) client-side JS explicitly requesting specific formats.
There are pros & cons to either method, and they aren't mutually exclusive. HTTP content negotiation can work without any client-side effort for bitmap formats (ex: Chrome advertises WebM support). Client side JS can add explicit parameters for more fine-grained control, but also has only limited information to base such decisions on.
Either way, the status quo and starting point is to serve widely supported formats (JPG and PNG) by default. We don't need to solve this question right away, and the proposed API is leaving all options for content negotiation open.
Deployment strategy
A change like this isn't complete without a strategy that allows us to roll out new-style thumbs gradually. To avoid performance impacts, all (simple) requests for a thumb of a given size should map to the same cache entries using either URL scheme. To this end, we can roll things out in a way that lets us *rewrite* one URL scheme to the other.
- New to old style:
- Simple thumbs (only width parameter specified):
- Prefix the image name with the width parameter value followed by 'px-'.
- Calculate the MD5 of the image name, and prefix first & first two chars of hex encoding to path (ex: /7/72/).
- Complex thumbs: Let PHP code handle the request & cache the response separately.
- Simple thumbs (only width parameter specified):
- Old to new style:
- Simple thumbs (need to check if we can determine this with a regex):
- Extract & strip width parameter
- Send a request with a query string to backend
- Complex thumbs: Let PHP code handle the request & cache the response separately.
- Simple thumbs (need to check if we can determine this with a regex):
Users / apps making assumptions about the current thumb URL format
See T153498.
Migration strategies in Varnish
- Feasibility of rewriting majority of "simple" thumbnails (no key-value parameters) in Varnish.
- Feasibility of avoiding redirect latency penalty by resolving redirect responses from thumb service in Varnish.
Small MediaWiki installs
Currently, MediaWiki defaults to serving thumbnails directly from an upload directory. This means that there is no PHP code involved in serving thumbnails. This is good for performance (especially without caching), but also means that on-demand generation & a parameter-based API cannot be supported out of the box. There are several options we can pursue:
- Start to serve all thumbs through thumb.php (or an API module). The migration to this would generally be easy (especially with the API), but we would add the overhead of serving thumbs through PHP. Authentication would be supported out of the box. Caching could eliminate the performance issue for higher volume installs, or in an appliance container install that includes Varnish.
- Create a new way of supporting direct file serving with 404 handler and storage based on encoded query strings. This requires an advanced web server configuration, and might not be possible with less common web servers.
At this point, the default option would be 1), but we can always optimize the setup with 2). We can provide 2) by default in a container-based distribution solution.