Preload the next page's image on proofreading view
Closed, ResolvedPublicFeature
Actions

Assigned To

Authored By

	Nemo_bis
	Aug 18 2019, 1:28 PM

Description

The Polish Wikisource preloads the next page's image while one is proofreading one page, which speeds up the load time significantly especially if you have lower bandwidth. It's implemented with a local gadget (not enabled by default for now).
https://wikisource.org/wiki/MediaWiki:Gadget-preload-prp-page-image.js

You could call it over-eager loading or prefetching. Otherwise, if indeed the thumbnail generation is the biggest part of the waiting time, especially for larger PDF/DjVu files and in some periods of the day when the imagescalers might be busier, it could be enough to request/generate the thumbnail server-side without loading it on the client, so it's ready on the cache.

Wikisource users at https://wikimania.wikimedia.org/wiki/2019:Transcription/Wrap-up and https://wikisource.org/wiki/Wikisource:Best_Practices have agreed this would be a feature worth implementing by default.

Details

	Subject	Repo	Branch	Lines +/-
	Add prefetch links for prev and next page images	mediawiki/extensions/ProofreadPage	master	+70 -43
	Adds a rel=prefetch for the previous and next Page: page	mediawiki/extensions/ProofreadPage	master	+20 -4

Customize query in gerrit

Related Objects

Mentioned In: T320640: Three pages appear are marked as seen in the watchlist when visiting only one of them (French Wikisource)
T320841: srcset for image on proofreading edit page differes by 1px
T299124: ProofreadPage frontend makes a request to the page before and after in every page view
T191286: [Suggestion] Pregenerate next/previous thumb when reading a pdf (or other multi-page file types)
T286356: WS OCR: Cache next page(s) for even faster loading

Event Timeline

Nemo_bis created this task.Aug 18 2019, 1:28 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 18 2019, 1:28 PM

Nemo_bis updated the task description. (Show Details)Aug 18 2019, 1:34 PM

For context, the "slowness" (75th percentile of processing wallclock time) varies quite wildly for ghostscript and djvu https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?orgId=1&panelId=11&fullscreen&from=now-7d&to=now : 2-3 s average and 10-15 s max is rather underwhelming.

Candalua awarded a token.Aug 23 2019, 5:49 PM

Candalua subscribed.

Ankry added a subscriber: Tpt.Nov 11 2019, 9:57 AM

Xover subscribed.Nov 11 2019, 10:19 AM

Incidentally, the correct place to solve this is in ProofreadPage because it has knowledge of what the next page in the sequence is. However, I suspect it would need facilities from core to do this intelligently: standard prefetching requires injecting a <link rel="prefetch" href="…" /> in the page, preferably before the browser sees it. ProofreadPage could of course "cheat" the way this gadget does (good version 1.0 / proof of concept / minimum viable product goal?), but that'd be kinda complicated, inelegant, overly specific, and prone to breaking over time.

On the upside, combining this with async (AJAX) save and preview could potentially give literal orders of magnitude improvement in perceived performance in the critical workflow for Wikisource!

This is a great idea! Adding this prefetch should be indeed fairly easy on the ProofreadPage side.

Change 550568 had a related patch set uploaded (by Tpt; owner: Tpt):
[mediawiki/extensions/ProofreadPage@master] Adds a rel=prefetch for the next Page: page

https://gerrit.wikimedia.org/r/550568

gerritbot added a project: Patch-For-Review.Nov 12 2019, 10:38 PM

Tpt changed the subtype of this task from "Task" to "Feature Request".Nov 14 2019, 9:19 PM

Inductiveload subscribed.Jun 12 2020, 1:56 PM

Change 550568 merged by jenkins-bot:
[mediawiki/extensions/ProofreadPage@master] Adds a rel=prefetch for the previous and next Page: page

https://gerrit.wikimedia.org/r/550568

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.20; 2020-12-01).Nov 28 2020, 12:00 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 28 2020, 12:10 PM

Thanks!

In T230689#5420456, @Nemo_bis wrote:

For context, the "slowness" (75th percentile of processing wallclock time) varies quite wildly for ghostscript and djvu https://grafana.wikimedia.org/d/0fj55kRZz/thumbor?orgId=1&panelId=11&fullscreen&from=now-7d&to=now : 2-3 s average and 10-15 s max is rather underwhelming.

That graph is being deleted as has not been updated for over a year, so I'm archiving a screenshot here.

Screenshot_20201130_120944.png (612×1 px, 124 KB)

Samwilson mentioned this in T286356: WS OCR: Cache next page(s) for even faster loading.Jul 16 2021, 4:31 AM

Change 704888 had a related patch set uploaded (by Samwilson; author: Samwilson):

[mediawiki/extensions/ProofreadPage@master] Add prefetch links for prev/next page images

https://gerrit.wikimedia.org/r/704888

gerritbot added a project: Patch-For-Review.Jul 16 2021, 5:55 AM

I think we should use preload rather than prefetch, because it more readily works to actually request the image (which will kick off the thumbnail rendering). In my testing, prefetch doesn't always do anything (I think because it waits for browser idle time). I seem to remember we had a conversation about this some years ago, but can't find it now nor remember the reasoning.

Samwilson claimed this task.Jul 16 2021, 6:22 AM

Samwilson added a project: Community-Tech (CommTech-Sprint-4).

Samwilson moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (CommTech-Sprint-4) board.

If I remind correctly preload is to ask the browser to aggressively fetch a content for the current page rendering and prefetch is to ask the browser to load the content for future use in the navigation.

Yes, indeed, preload will work better images because they are going to be saved in the browser cache and so used for the next pages and we are nearly sure they will be loaded.
However, using it might slow done navigation on lower network speeds. It is also possible that some brower do preload when a data limiter is turned on. So, it might even be detrimental for people on lower speed or pricey connections.

So, I would be more in favor of closely following the spec and and use prefetch ("preemptively fetching and caching the specified resource is likely to be beneficial") rather than preload ("the user agent must preemptively fetch and cache the specified resource") to be sure we don't harm the users we want to help in the process.

The main goal here is to perform server-side page thumbnail generation in advance. If this can be made without preloading, it is OK. I am not sure if prefetch would provide this.
Thumbnail generation process has been identified to be the most time-consuming part of the "next page" loading process. So making the next thumbnail ready server-side is the main goal.

Other client-side benefits are less important and can likely be neglected. IMO, only users with low-speed network connections can have other significant benefits related to loading the next page image while the current page is being rendered or has already been rendered. Loading of the next image should not delay the page rendering in the browser (assuming enough resources client-side to perform both actions: rendering & loading the next image simultaneously)

Concerning the cost of paid connections: what is the main use-case of Page namespace pages? The extra cost is high if people load a single page of few pages in random. If they load 10 pages in a serie, the preloading cost is 10% (11 instead of 10 images loaded). If they load 100 subsequent pages, the cost is 1%.

At the moment the feature might be useless on mobile as Page namespace is hardly useful on small screens.

But as I mentioned: the main goal is to generate the thumbnail for the next page and make it ready for download server-side. Anything else is secondary.

@Ankry Thank you! So, I guess that the main goal is to make sure that the image thumbnail is already generated when a Page: page is displayed.

The File::transform method allows to set the option File::RENDER_NOW. It might be relevant to set it when rendering a Page: page. This way, we would make sure that the image rendering is started even before the HTML is returned to the client. But File::RENDER_NOW seems to generate the image in the current thread so will slow down signficiantly the HTML generation time if we naively run this function when rendering the HTML. A possible thing is maybe to register a ThumbnailRenderJob to have a rendering job already launched. But it might add quite a lot of weight to the job queue without much gain if the job queue has more than a few seconds of lag.

If the thumbnail generation cannot be requested asynchronously, to avoid delay in presenting current page HTML code to user (and utilize the time when the page is being downloaded / rendered by the browser for thumbnail generation), then (IMO) it is better to request the next thumbnail generation from some javascript code which would be invoked after the page is already displayed in the browser. This is the way that current pl.ws/it.ws gadgets seem to utilize.

If you think that ''prefetch'' may also give this result in some cases, it would be also OK. But both only if we do not need to care about race condition in thumbnail generation (or eg. the risk of being blocked due to too many requests).

Note that an HTTP HEAD request should be enough to generate (and, crucially, cache) the thumbnail, the client doesn't have to download the image data.

I think we're overcomplicating this.

Page: namespace in edit mode isn't usable for anything smaller than a tablet (so 99.9% of cellphone users are irrelevant; and the 0.1% left have more pressing problems than the extra thumbnail). That eliminates most scenarios where outlier slow connection speeds or metered connections are factors. As Ankry notes above, except in pathological cases, at most 1 image will be loaded unnecessarily. And the typical size of a page thumbnail is so tiny that if you load the front page of a single news site you'll download multiple many times that in just JavaScript.

Heck, even compared to the weight of a single wiki page the thumbnails ain't all that. A typical 1024px thumbnail is maybe 300–400kB, but a quick check suggests we pull down 450kB of CSS and JS on every single page load, even on pages that aren't heavy with templates and TemplateStyles.

Meanwhile, preloading the page image warms the caches of both Thumbor and the web browser, reducing latency in the hot path for interactive use by orders of magnitude (seconds to milliseconds) for however many pages they proofread in a session. In particular, it potentially eliminates users having to wait for the pathological cases where Ghostscript takes 20+ seconds to render a single thumbnail.

Absent a concrete real-world example of a problem created by preloading the next page image, I don't think it's worth worrying about.

I agree in principle: thumbnail downloading is not a major cause for concern in the context. But JS caching does actually mean that the total JS transferred on navigating "next page" is roughly zero. For an un-logged-in user (i.e. one without a user JS full of AJAX requests), there are only three uncached requests on loading the next page: the main GET (~10KB), 2kB of "intake-analytics" and the image (generally 100-400kB).

That said, MediaWiki:Gadget-Preload_Page_Images.js does indeed download the entire image to the client, which, as you say has the benefit of warming the client image cache too and "front-loading" the network transfer into the slack time while the user is faffing about on the current page. (and HTTP HEAD will warm the Thumbor cache, but the client still needs to request it, transfer it, and render it).

If there is truly a concern here, gate it behind a mobile-skin check and let mobile users (if there even are any in Page NS) eat the thumbnail+network+client render delay. That's trivial enough on both server and client side. But mobile users won't really be on this page until we make Page NS functional on mobile, so ¯\_(ツ)_/¯.

I think I agree with you all! I'm still in favour of preload rather than prefetch, because I think it has the most benefit for the most users. Some users will waste data, perhaps, but in the scheme of the modern web I think an average of under half a MB is nearly nothing. If the width param of an Index page is large, of course, this could increase. It feels like we could implement this, and see how it fares in real-world use, and iterate as required.

I had a play with sending a HEAD request for the next image, and that does seem to work (i.e. is quick to do, and subsequent requests are quicker than they otherwise would be), but ultimately I think it's not a good idea to write our own code to do this when browsers are supposed to support this exact use case.

One question: do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

In T230689#7219296, @Tpt wrote:

@Ankry Thank you! So, I guess that the main goal is to make sure that the image thumbnail is already generated when a Page: page is displayed.

The File::transform method allows to set the option File::RENDER_NOW. It might be relevant to set it when rendering a Page: page. This way, we would make sure that the image rendering is started even before the HTML is returned to the client. But File::RENDER_NOW seems to generate the image in the current thread so will slow down signficiantly the HTML generation time if we naively run this function when rendering the HTML. A possible thing is maybe to register a ThumbnailRenderJob to have a rendering job already launched. But it might add quite a lot of weight to the job queue without much gain if the job queue has more than a few seconds of lag.

On Wikimedia sites, thumbnail rendering is handled by Thumbor, not MediaWiki. In this case MediaWiki does not know or care about the thumbnails, it just makes URLs. Thumbnail rendering begins when the user agent requests the image (if the requested thumbnail is not already in the Varnish HTTP cache or the Swift storage).

One question: do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

There's no real harm in that IMO. Whichever way the user is navigating, the last page they were on will still be in the cache. So they will preload only one extra image per "straight run" of pages:

Editing 2, 3, 4:

1 2 3 4 
| ^ ^ ^
\ "wasted"

Editing 12, 11, 10:

10 11 12 13
 ^  ^  ^  |
          \ "wasted"

Moreover, it's quite likely that image will eventually be used in a session anyway (I know I personally bounce all over the place), so it will help that page be snappier too.

In T230689#7220018, @Samwilson wrote:

… an average of under half a MB is nearly nothing.

Half a meg per proofreading session, not per page load, let's be very clear. .5 MB actually wasted per page load would be kinda bad (expect the Performance Team Ninja Hit Squad to kick down your door), but since it's only the last preloaded image in a session that is actually wasted, when you get to 10 pages the overhead (≤ 50kB/page avg.) begins to be negligible and above that it completely disappears in the random variability of other factors. And this is probably the worst case of the realistic scenarios: most page thumbnails are not going to be 500kB+.

do we want to preload the previous page's image as well? We're already prefetching the previous page. (Personally, I like to work backwards in a work, so find it useful, but I'm perhaps in a minority.)

My impression is that this is a minority workflow so if we're worried about the data overhead the cost—benefit might not work out. Personally I think this interaction is so critical on Wikisource that optimising it easily justifies trading off bandwidth efficiency in the order we're talking about here (if I ever get around to stealing Alex's awesome eis I would absolutely set up not just preloading but even prerendering of several pages in both directions).

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-5); removed Community-Tech (CommTech-Sprint-4).Jul 19 2021, 1:41 PM

ldelench_wmf moved this task from Ready 🎬 to Review/Feedback 💬 on the Community-Tech (CommTech-Sprint-5) board.

Daimona subscribed.Jul 19 2021, 3:28 PM

Thank you @Ankry, @Xover @Inductiveload! So I guess rel="preload" seems the way to go for the next page image. What about using rel="prefetch" for the previous one? This way, it gets a good chance to be loaded too but with less priority. What do you think about it?

In T230689#7220040, @AntiCompositeNumber wrote:

On Wikimedia sites, thumbnail rendering is handled by Thumbor, not MediaWiki. In this case MediaWiki does not know or care about the thumbnails, it just makes URLs. Thumbnail rendering begins when the user agent requests the image (if the requested thumbnail is not already in the Varnish HTTP cache or the Swift storage).

Thank you! On wikis with Thumbor, the ThumbnailRenderJob seems to fire an HTTP request to the image URL in order to make sure it is rendered. This job seems to be used after image upload to fill the cache with some common thumbnails. So, it would cover our use case (but with the job queue and lag problems listed previously).

Sounds like a reasonable tradeoff to me.

Yep, IMO just get it done at this point and tweak it up later if we find we've over- or under-cooked it for common workflows.

BTW, do we have analytics of how long our users are taking to load pages? I'd be VERY interested to see the graph change after this change.

So I guess rel="preload" seems the way to go for the next page image. What about using rel="prefetch" for the previous one? This way, it gets a good chance to be loaded too but with less priority. What do you think about it?

Good plan. I've updated the above patch.

Change 704888 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Add prefetch links for prev and next page images

https://gerrit.wikimedia.org/r/704888

Daimona moved this task from Review/Feedback 💬 to QA 🐛 on the Community-Tech (CommTech-Sprint-5) board.Jul 28 2021, 2:27 PM

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.17; 2021-08-02).Jul 28 2021, 3:00 PM

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-6); removed Community-Tech (CommTech-Sprint-5).Aug 2 2021, 6:10 PM

ldelench_wmf moved this task from Ready 🎬 to QA 🐛 on the Community-Tech (CommTech-Sprint-6) board.

ldelench_wmf edited projects, added Community-Tech (CommTech-Sprint-7); removed Community-Tech (CommTech-Sprint-6).Aug 16 2021, 1:49 PM

ldelench_wmf moved this task from Ready 🎬 to QA 🐛 on the Community-Tech (CommTech-Sprint-7) board.

• imaigwilo subscribed.Aug 19 2021, 10:15 PM

This comment was removed by • imaigwilo.

• imaigwilo moved this task from QA 🐛 to Product sign-off 🤘 on the Community-Tech (CommTech-Sprint-7) board.Aug 19 2021, 10:17 PM

• imaigwilo moved this task from Product sign-off 🤘 to QA 🐛 on the Community-Tech (CommTech-Sprint-7) board.Aug 19 2021, 10:24 PM

dom_walden added a project: QTE-TestingOverview.Aug 24 2021, 8:07 AM

dom_walden moved this task from Inbox to Dom on the QTE-TestingOverview board.

@Samwilson Thanks for your help figuring a viable way to test this feature.
Verification steps:
On a browser open developer tool and load the network tab
Load the page link e.g. https://en.wikisource.org/w/index.php?title=Page:The_New_York_Times,_1900-12-01.djvu/17&action=edit&redlink=1
Click the transcribe button
On the developer tools the next page is also transcribed and cached.
See screenshot below:

• imaigwilo moved this task from QA 🐛 to Product sign-off 🤘 on the Community-Tech (CommTech-Sprint-7) board.Aug 24 2021, 1:48 PM

jcrespo mentioned this in T191286: [Suggestion] Pregenerate next/previous thumb when reading a pdf (or other multi-page file types).Aug 24 2021, 1:53 PM

dom_walden removed a project: QTE-TestingOverview.Aug 24 2021, 2:13 PM

used inspector too

ldelench_wmf moved this task from Product sign-off 🤘 to Done 🏁 on the Community-Tech (CommTech-Sprint-7) board.Aug 27 2021, 1:53 PM

Tpt mentioned this in T299124: ProofreadPage frontend makes a request to the page before and after in every page view.Jan 13 2022, 10:49 AM

Zdzislaw mentioned this in T320841: srcset for image on proofreading edit page differes by 1px.Oct 15 2022, 9:39 AM

matmarex mentioned this in T320640: Three pages appear are marked as seen in the watchlist when visiting only one of them (French Wikisource).Nov 11 2022, 8:02 PM

	F34607309: Screen Shot 2021-08-18 at 8.19.48 PM.png
	Aug 19 2021, 10:15 PM

	F33928857: Screenshot_20201130_120944.png
	Nov 30 2020, 10:11 AM

	F34618889: ocr.jpg
	Aug 24 2021, 3:54 PM

Preload the next page's image on proofreading viewClosed, ResolvedPublicFeatureActions

Description

Details

Related Objects

Event Timeline

Preload the next page's image on proofreading view
Closed, ResolvedPublicFeature
Actions