Page MenuHomePhabricator

Preload the next page's image on proofreading view
Closed, ResolvedPublicFeature

Description

The Polish Wikisource preloads the next page's image while one is proofreading one page, which speeds up the load time significantly especially if you have lower bandwidth. It's implemented with a local gadget (not enabled by default for now).
https://wikisource.org/wiki/MediaWiki:Gadget-preload-prp-page-image.js

You could call it over-eager loading or prefetching. Otherwise, if indeed the thumbnail generation is the biggest part of the waiting time, especially for larger PDF/DjVu files and in some periods of the day when the imagescalers might be busier, it could be enough to request/generate the thumbnail server-side without loading it on the client, so it's ready on the cache.

Wikisource users at https://wikimania.wikimedia.org/wiki/2019:Transcription/Wrap-up and https://wikisource.org/wiki/Wikisource:Best_Practices have agreed this would be a feature worth implementing by default.

Event Timeline

For context, the "slowness" (75th percentile of processing wallclock time) varies quite wildly for ghostscript and djvu https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?orgId=1&panelId=11&fullscreen&from=now-7d&to=now : 2-3 s average and 10-15 s max is rather underwhelming.

Incidentally, the correct place to solve this is in ProofreadPage because it has knowledge of what the next page in the sequence is. However, I suspect it would need facilities from core to do this intelligently: standard prefetching requires injecting a <link rel="prefetch" href="…" /> in the page, preferably before the browser sees it. ProofreadPage could of course "cheat" the way this gadget does (good version 1.0 / proof of concept / minimum viable product goal?), but that'd be kinda complicated, inelegant, overly specific, and prone to breaking over time.

On the upside, combining this with async (AJAX) save and preview could potentially give literal orders of magnitude improvement in perceived performance in the critical workflow for Wikisource!

This is a great idea! Adding this prefetch should be indeed fairly easy on the ProofreadPage side.

Change 550568 had a related patch set uploaded (by Tpt; owner: Tpt):
[mediawiki/extensions/ProofreadPage@master] Adds a rel=prefetch for the next Page: page

https://gerrit.wikimedia.org/r/550568

Tpt changed the subtype of this task from "Task" to "Feature Request".Nov 14 2019, 9:19 PM

Change 550568 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Adds a rel=prefetch for the previous and next Page: page

https://gerrit.wikimedia.org/r/550568

Thanks!

For context, the "slowness" (75th percentile of processing wallclock time) varies quite wildly for ghostscript and djvu https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?orgId=1&panelId=11&fullscreen&from=now-7d&to=now : 2-3 s average and 10-15 s max is rather underwhelming.

That graph is being deleted as has not been updated for over a year, so I'm archiving a screenshot here.

Screenshot_20201130_120944.png (612×1 px, 124 KB)

Change 704888 had a related patch set uploaded (by Samwilson; author: Samwilson):

[mediawiki/extensions/ProofreadPage@master] Add prefetch links for prev/next page images

https://gerrit.wikimedia.org/r/704888

I think we should use preload rather than prefetch, because it more readily works to actually request the image (which will kick off the thumbnail rendering). In my testing, prefetch doesn't always do anything (I think because it waits for browser idle time). I seem to remember we had a conversation about this some years ago, but can't find it now nor remember the reasoning.

If I remind correctly preload is to ask the browser to aggressively fetch a content for the current page rendering and prefetch is to ask the browser to load the content for future use in the navigation.

Yes, indeed, preload will work better images because they are going to be saved in the browser cache and so used for the next pages and we are nearly sure they will be loaded.
However, using it might slow done navigation on lower network speeds. It is also possible that some brower do preload when a data limiter is turned on. So, it might even be detrimental for people on lower speed or pricey connections.

So, I would be more in favor of closely following the spec and and use prefetch ("preemptively fetching and caching the specified resource is likely to be beneficial") rather than preload ("the user agent must preemptively fetch and cache the specified resource") to be sure we don't harm the users we want to help in the process.

The main goal here is to perform server-side page thumbnail generation in advance. If this can be made without preloading, it is OK. I am not sure if prefetch would provide this.
Thumbnail generation process has been identified to be the most time-consuming part of the "next page" loading process. So making the next thumbnail ready server-side is the main goal.

Other client-side benefits are less important and can likely be neglected. IMO, only users with low-speed network connections can have other significant benefits related to loading the next page image while the current page is being rendered or has already been rendered. Loading of the next image should not delay the page rendering in the browser (assuming enough resources client-side to perform both actions: rendering & loading the next image simultaneously)

Concerning the cost of paid connections: what is the main use-case of Page namespace pages? The extra cost is high if people load a single page of few pages in random. If they load 10 pages in a serie, the preloading cost is 10% (11 instead of 10 images loaded). If they load 100 subsequent pages, the cost is 1%.

At the moment the feature might be useless on mobile as Page namespace is hardly useful on small screens.

But as I mentioned: the main goal is to generate the thumbnail for the next page and make it ready for download server-side. Anything else is secondary.

@Ankry Thank you! So, I guess that the main goal is to make sure that the image thumbnail is already generated when a Page: page is displayed.

The File::transform method allows to set the option File::RENDER_NOW. It might be relevant to set it when rendering a Page: page. This way, we would make sure that the image rendering is started even before the HTML is returned to the client. But File::RENDER_NOW seems to generate the image in the current thread so will slow down signficiantly the HTML generation time if we naively run this function when rendering the HTML. A possible thing is maybe to register a ThumbnailRenderJob to have a rendering job already launched. But it might add quite a lot of weight to the job queue without much gain if the job queue has more than a few seconds of lag.

If the thumbnail generation cannot be requested asynchronously, to avoid delay in presenting current page HTML code to user (and utilize the time when the page is being downloaded / rendered by the browser for thumbnail generation), then (IMO) it is better to request the next thumbnail generation from some javascript code which would be invoked after the page is already displayed in the browser. This is the way that current pl.ws/it.ws gadgets seem to utilize.

If you think that ''prefetch'' may also give this result in some cases, it would be also OK. But both only if we do not need to care about race condition in thumbnail generation (or eg. the risk of being blocked due to too many requests).

Note that an HTTP HEAD request should be enough to generate (and, crucially, cache) the thumbnail, the client doesn't have to download the image data.

I think we're overcomplicating this.

Page: namespace in edit mode isn't usable for anything smaller than a tablet (so 99.9% of cellphone users are irrelevant; and the 0.1% left have more pressing problems than the extra thumbnail). That eliminates most scenarios where outlier slow connection speeds or metered connections are factors. As Ankry notes above, except in pathological cases, at most 1 image will be loaded unnecessarily. And the typical size of a page thumbnail is so tiny that if you load the front page of a single news site you'll download multiple many times that in just JavaScript.

Heck, even compared to the weight of a single wiki page the thumbnails ain't all that. A typical 1024px thumbnail is maybe 300–400kB, but a quick check suggests we pull down 450kB of CSS and JS on every single page load, even on pages that aren't heavy with templates and TemplateStyles.

Meanwhile, preloading the page image warms the caches of both Thumbor and the web browser, reducing latency in the hot path for interactive use by orders of magnitude (seconds to milliseconds) for however many pages they proofread in a session. In particular, it potentially eliminates users having to wait for the pathological cases where Ghostscript takes 20+ seconds to render a single thumbnail.

Absent a concrete real-world example of a problem created by preloading the next page image, I don't think it's worth worrying about.

I agree in principle: thumbnail downloading is not a major cause for concern in the context. But JS caching does actually mean that the total JS transferred on navigating "next page" is roughly zero. For an un-logged-in user (i.e. one without a user JS full of AJAX requests), there are only three uncached requests on loading the next page: the main GET (~10KB), 2kB of "intake-analytics" and the image (generally 100-400kB).

That said, MediaWiki:Gadget-Preload_Page_Images.js does indeed download the entire image to the client, which, as you say has the benefit of warming the client image cache too and "front-loading" the network transfer into the slack time while the user is faffing about on the current page. (and HTTP HEAD will warm the Thumbor cache, but the client still needs to request it, transfer it, and render it).

If there is truly a concern here, gate it behind a mobile-skin check and let mobile users (if there even are any in Page NS) eat the thumbnail+network+client render delay. That's trivial enough on both server and client side. But mobile users won't really be on this page until we make Page NS functional on mobile, so ¯\_(ツ)_/¯.

I think I agree with you all! I'm still in favour of preload rather than prefetch, because I think it has the most benefit for the most users. Some users will waste data, perhaps, but in the scheme of the modern web I think an average of under half a MB is nearly nothing. If the width param of an Index page is large, of course, this could increase. It feels like we could implement this, and see how it fares in real-world use, and iterate as required.

I had a play with sending a HEAD request for the next image, and that does seem to work (i.e. is quick to do, and subsequent requests are quicker than they otherwise would be), but ultimately I think it's not a good idea to write our own code to do this when browsers are supposed to support this exact use case.

One question: do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

@Ankry Thank you! So, I guess that the main goal is to make sure that the image thumbnail is already generated when a Page: page is displayed.

The File::transform method allows to set the option File::RENDER_NOW. It might be relevant to set it when rendering a Page: page. This way, we would make sure that the image rendering is started even before the HTML is returned to the client. But File::RENDER_NOW seems to generate the image in the current thread so will slow down signficiantly the HTML generation time if we naively run this function when rendering the HTML. A possible thing is maybe to register a ThumbnailRenderJob to have a rendering job already launched. But it might add quite a lot of weight to the job queue without much gain if the job queue has more than a few seconds of lag.

On Wikimedia sites, thumbnail rendering is handled by Thumbor, not MediaWiki. In this case MediaWiki does not know or care about the thumbnails, it just makes URLs. Thumbnail rendering begins when the user agent requests the image (if the requested thumbnail is not already in the Varnish HTTP cache or the Swift storage).

One question: do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

There's no real harm in that IMO. Whichever way the user is navigating, the last page they were on will still be in the cache. So they will preload only one extra image per "straight run" of pages:

Editing 2, 3, 4:

1 2 3 4 
| ^ ^ ^
\ "wasted"

Editing 12, 11, 10:

10 11 12 13
 ^  ^  ^  |
          \ "wasted"

Moreover, it's quite likely that image will eventually be used in a session anyway (I know I personally bounce all over the place), so it will help that page be snappier too.

… an average of under half a MB is nearly nothing.

Half a meg per proofreading session, not per page load, let's be very clear. .5 MB actually wasted per page load would be kinda bad (expect the Performance Team Ninja Hit Squad to kick down your door), but since it's only the last preloaded image in a session that is actually wasted, when you get to 10 pages the overhead (≤ 50kB/page avg.) begins to be negligible and above that it completely disappears in the random variability of other factors. And this is probably the worst case of the realistic scenarios: most page thumbnails are not going to be 500kB+.

do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

My impression is that this is a minority workflow so if we're worried about the data overhead the cost—benefit might not work out. Personally I think this interaction is so critical on Wikisource that optimising it easily justifies trading off bandwidth efficiency in the order we're talking about here (if I ever get around to stealing Alex's awesome eis I would absolutely set up not just preloading but even prerendering of several pages in both directions).

Thank you @Ankry, @Xover @Inductiveload! So I guess rel="preload" seems the way to go for the next page image. What about using rel="prefetch" for the previous one? This way, it gets a good chance to be loaded too but with less priority. What do you think about it?

On Wikimedia sites, thumbnail rendering is handled by Thumbor, not MediaWiki. In this case MediaWiki does not know or care about the thumbnails, it just makes URLs. Thumbnail rendering begins when the user agent requests the image (if the requested thumbnail is not already in the Varnish HTTP cache or the Swift storage).

Thank you! On wikis with Thumbor, the ThumbnailRenderJob seems to fire an HTTP request to the image URL in order to make sure it is rendered. This job seems to be used after image upload to fill the cache with some common thumbnails. So, it would cover our use case (but with the job queue and lag problems listed previously).

Sounds like a reasonable tradeoff to me.

Yep, IMO just get it done at this point and tweak it up later if we find we've over- or under-cooked it for common workflows.

BTW, do we have analytics of how long our users are taking to load pages? I'd be VERY interested to see the graph change after this change.

So I guess rel="preload" seems the way to go for the next page image. What about using rel="prefetch" for the previous one? This way, it gets a good chance to be loaded too but with less priority. What do you think about it?

Good plan. I've updated the above patch.

Change 704888 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Add prefetch links for prev and next page images

https://gerrit.wikimedia.org/r/704888

@Samwilson Thanks for your help figuring a viable way to test this feature.
Verification steps:
On a browser open developer tool and load the network tab
Load the page link e.g. https://en.wikisource.org/w/index.php?title=Page:The_New_York_Times,_1900-12-01.djvu/17&action=edit&redlink=1
Click the transcribe button
On the developer tools the next page is also transcribed and cached.
See screenshot below:

ocr.jpg (1×2 px, 1 MB)

NRodriguez subscribed.

used inspector too