Page MenuHomePhabricator

ProofreadPage frontend makes a request to the page before and after in every page view
Closed, ResolvedPublic

Description

Check https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/52 with inspect element

If you check HTML requests, you see it makes a request to load https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/51 and https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/53. If the frontend code needs some information on the page before and after, it should the API instead. This triggers a parse that doesn't even get shown to the user.

Event Timeline

rel=prefetch has been added targetting the previous and next pages to speed up bulk proofreading according to T230689

It is probably worth revisiting, yes, but note that interactive performance here depends on dynamically generated "thumbnail" images, where the generation involves shelling out to ghostscript (for PDFs) and ddjvu (for DjVu) to extract a given page from a possibly ~1GB multi-page document and rendering it to a JPEG. Ghostscript, in particular, can (anecdotally) take ~20 seconds to do this.

From an interactive performance perspective, per-page thumbnails would ideally be pre-generated so only the normal image loading latency was a factor (and even then it might be desirable to prime the client-side cache). If one is willing to trade off the disk space, pre-generating all pages would be good; but even pre-generating 5–10 before and after the current page would be a massive improvement. The current prefetching is a compromise in that it tries to induce the client to make a request that will make the backend pre-generate at least +/- one page, which is the furthest you could reasonably go within the scope of PRP (or at least I think you'd need to touch Thumbor to take it further).

Keep in mind that Proofreadpage/Wikisource workflow is inherently sequential. Once a Wikipedia article is loaded you just scroll vertically as needed as fast as your client device can do it, but in the context of Wikisource and PRP the page load was essentially at the paragraph level (not even a section, much less and entire article).

Oh, and the prefetch hit only happens in the Page: namespace, so it's only relevant for contributors (active logged-in Wikisource users). Anyone just browsing will never see pages in the Page: namespace.

It is probably worth revisiting, yes, but note that interactive performance here depends on dynamically generated "thumbnail" images, where the generation involves shelling out to ghostscript (for PDFs) and ddjvu (for DjVu) to extract a given page from a possibly ~1GB multi-page document and rendering it to a JPEG. Ghostscript, in particular, can (anecdotally) take ~20 seconds to do this.

Then it can make a request to the thumbnail instead of the page. I think this at least reduces the pressure on the appservers.

Then it can make a request to the thumbnail instead of the page. I think this at least reduces the pressure on the appservers.

Sure. The page image ("thumbnail") is the most critical, since that is by far and away what takes longest to load. Without having measured it, I suspect on the rest the fetch latency somewhat disappears behind CSS and JS rendering (and I'm pretty sure JS is causing multiple repaints, so that's where the biggest bang for the buck is). The "next" wikipage is also normally a redlink for the majority of cases, so there's no real "parse" involved: it's just fetching UI, UI assets, the page image, and the text layer (from image metadata).

But this is still a sequential workflow and loading the next Page: page is enough of a pain point that there are multiple community Gadgets that preload the next page. Those will continue to be used until / unless loading the next page gets effectively instantaneous (think AMP or SPA). If the goal is to reduce the load on the app servers, and PRP actually rises above the noise floor there, then some serious optimisation will have to be undertaken at some point.

(But on the flip side, editing and viewing patterns on Wikisource are so different from, say, Wikipedia that there is a lot of scope for effective optimization, depending on what you need to optimize for. E.g. mainspace pages are so rarely edited that they could in practice be fully-protected and served statically, and Index:/Page:-namespace pages are typically edited in a relatively brief flurry and mostly sequentially, and then never touched again. Only the project namespace, Wikisource:, really follows the normal pattern from e.g. Wikipedia:. A heat map of activity across wikipages would show small hotspots in an otherwise completely cool calm background. (Hmm. I should actually try to make that heatmap. It'd be an interesting visualisation of the differences between projects.))

matmarex subscribed.

As noted in T320640, this also causes confusing behavior with the watchlist, where if you visit any book page, the previous and next pages are also marked as visited.

Change 844912 had a related patch set uploaded (by Tpt; author: Tpt):

[mediawiki/extensions/ProofreadPage@master] Drops prefetching previous and next pages

https://gerrit.wikimedia.org/r/844912

Change 844912 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Drops prefetching previous and next pages

https://gerrit.wikimedia.org/r/844912

Ladsgroup assigned this task to Tpt.