Page MenuHomePhabricator

Pregenerate Wikisource-useful thumbnail sizes for multi-page media
Open, Needs TriagePublicFeature

Description

cf. T337649

The ProofreadPage extension provides the core editing workflow for the Wikisourcen, through providing an interface that displays the thumbnail for each page in a multi-page media file (DjVu, PDF, TIFF, etc.) alongside the associated wikipage, allowing the contributors to transcribe and format the text of the thumb (typically a book page) as a wikipage. Each bunch of wikipages is later transcluded into a chapter or similarly-sized unit for display to readers.

Thumbor currently automatically pregenerates a few (four?) sizes of thumbnails for all media files to optimize performance and reduce hot-path load on the Thumbor+Swift++ stack. These thumbnails are all relatively small sizes, suitable for viewing as an inline embedded image in a Wikipedia article. Thumbnail pregeneration is also hard-capped at the first 50 pages for multi-page media.

This approach makes sense for Wikipedia since that project's main unit of work is the article, and its performance-critical needs are small inline images for readers and the vast majority of multi-page media requests are likely to be for covers, title pages, frontispieces, and other content that is typically found somewhere in the first fifty or so pages of a book.

However, this does not at all serve the use case on the Wikisourcen. The performance-critical need for thumbnails is for editors (Wikisource contributors), not readers, and thumbnails are critical for the main editing workflow. The size of thumbnail needed is relatively large (ideally the full native resolution of the media for that page) in order to enable transcription, but only one size is needed, and every page of the multi-page media will be accessed sequentially (i.e. most accesses will be to pages later than the first 50).

To work around this several bandaids have been tried in various parts of the stack. On-wiki Gadgets multiple) try to guess the next image and pre-request it in JavaScript in order to force Thumbor to warm its cache. The Proofread Page extension tried to emit prefetch headers to make the UA prefetch and warm its cache of the next wikipage (which turned out to mess up "seen" status for that page in the Watchlist).

The correct way to optimize this would be for Thumbor to pregenerate one thumbnail size suitable for proofreading, for all pages in a multi-page media file, at a size that is either predictable or discoverable for Proofread Page so that it can make sure to request that size thumb.

Having the majority of page images be a simple static fetch will improve interactive performance for the core Wikisource workflow significantly.

Event Timeline

On-wiki Gadgets multiple) try to guess the next image and pre-request it in JavaScript in order to force Thumbor to warm its cache.

@Xover Gadgets shouldn't have to guess, as of late last year, there is the [imageforpage API module](https://en.wikisource.org/wiki/Special:ApiSandbox#action=query&format=json&prop=imageforpage&titles=Page%3AWar%20and%20Peace.djvu%2F2&formatversion=2) that reliably warms up the cache and pre-renders the images that are to be used by Openseadragon (EIS uses this API to do it's preloading and prefetching).

I think overall this would be a good idea, but I personally want to wait untill the rollout of Edit-in-sequence and see if that satisfies the current needs (EIS preloads and prefetches all sizes of the next and previous page image, and in general in my testing, page image loads have been at-par/faster than page text loads).

I think overall this would be a good idea, but I personally want to wait untill the rollout of Edit-in-sequence and see if that satisfies the current needs

Addressing this in EIS is, in architectural terms, a layering violation: solving it in EIS solves it only for those using EIS, while addressing it in the same layer (multimedia stack) as other pre-gen thumbnails solves it for everyone.

I think overall this would be a good idea, but I personally want to wait untill the rollout of Edit-in-sequence and see if that satisfies the current needs

Addressing this in EIS is, in architectural terms, a layering violation: solving it in EIS solves it only for those using EIS, while addressing it in the same layer (multimedia stack) as other pre-gen thumbnails solves it for everyone.

I think you misunderstood what I meant here :) I want to see if the approach taken for EIS works/fulfills the needs, if so it might make sense to implement that approach in the multimedia stack/at the ProofreadPage level over the approach of pre-generating all 50 * 3 images by default (since the later will be a huge increase in the amount of computation that is done on Index: page creation).