ProofreadPage frontend makes a request to the page before and after in every page view
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Jan 13 2022, 10:28 AM

Description

Check https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/52 with inspect element

If you check HTML requests, you see it makes a request to load https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/51 and https://en.wikisource.org/wiki/Page:Popular_Science_Monthly_Volume_2.djvu/53. If the frontend code needs some information on the page before and after, it should the API instead. This triggers a parse that doesn't even get shown to the user.

Details

	Subject	Repo	Branch	Lines +/-
	Drops prefetching previous and next pages	mediawiki/extensions/ProofreadPage	master	+0 -8

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	Tpt	T320640 Three pages appear are marked as seen in the watchlist when visiting only one of them (French Wikisource)
		Resolved		Tpt	T299124 ProofreadPage frontend makes a request to the page before and after in every page view

Event Timeline

Ladsgroup created this task.Jan 13 2022, 10:28 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 13 2022, 10:28 AM

Ladsgroup mentioned this in T292300: Eliminate unnecessary duplicate parses (2021-2022).Jan 13 2022, 10:29 AM

rel=prefetch has been added targetting the previous and next pages to speed up bulk proofreading according to T230689

Thanks. I think it might be worth revisiting given that we now done T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata making it faster.

It is probably worth revisiting, yes, but note that interactive performance here depends on dynamically generated "thumbnail" images, where the generation involves shelling out to ghostscript (for PDFs) and ddjvu (for DjVu) to extract a given page from a possibly ~1GB multi-page document and rendering it to a JPEG. Ghostscript, in particular, can (anecdotally) take ~20 seconds to do this.

From an interactive performance perspective, per-page thumbnails would ideally be pre-generated so only the normal image loading latency was a factor (and even then it might be desirable to prime the client-side cache). If one is willing to trade off the disk space, pre-generating all pages would be good; but even pre-generating 5–10 before and after the current page would be a massive improvement. The current prefetching is a compromise in that it tries to induce the client to make a request that will make the backend pre-generate at least +/- one page, which is the furthest you could reasonably go within the scope of PRP (or at least I think you'd need to touch Thumbor to take it further).

Keep in mind that Proofreadpage/Wikisource workflow is inherently sequential. Once a Wikipedia article is loaded you just scroll vertically as needed as fast as your client device can do it, but in the context of Wikisource and PRP the page load was essentially at the paragraph level (not even a section, much less and entire article).

Oh, and the prefetch hit only happens in the Page: namespace, so it's only relevant for contributors (active logged-in Wikisource users). Anyone just browsing will never see pages in the Page: namespace.

In T299124#7619630, @Xover wrote:

It is probably worth revisiting, yes, but note that interactive performance here depends on dynamically generated "thumbnail" images, where the generation involves shelling out to ghostscript (for PDFs) and ddjvu (for DjVu) to extract a given page from a possibly ~1GB multi-page document and rendering it to a JPEG. Ghostscript, in particular, can (anecdotally) take ~20 seconds to do this.

Then it can make a request to the thumbnail instead of the page. I think this at least reduces the pressure on the appservers.

In T299124#7625200, @Ladsgroup wrote:

Then it can make a request to the thumbnail instead of the page. I think this at least reduces the pressure on the appservers.

Sure. The page image ("thumbnail") is the most critical, since that is by far and away what takes longest to load. Without having measured it, I suspect on the rest the fetch latency somewhat disappears behind CSS and JS rendering (and I'm pretty sure JS is causing multiple repaints, so that's where the biggest bang for the buck is). The "next" wikipage is also normally a redlink for the majority of cases, so there's no real "parse" involved: it's just fetching UI, UI assets, the page image, and the text layer (from image metadata).

But this is still a sequential workflow and loading the next Page: page is enough of a pain point that there are multiple community Gadgets that preload the next page. Those will continue to be used until / unless loading the next page gets effectively instantaneous (think AMP or SPA). If the goal is to reduce the load on the app servers, and PRP actually rises above the noise floor there, then some serious optimisation will have to be undertaken at some point.

(But on the flip side, editing and viewing patterns on Wikisource are so different from, say, Wikipedia that there is a lot of scope for effective optimization, depending on what you need to optimize for. E.g. mainspace pages are so rarely edited that they could in practice be fully-protected and served statically, and Index:/Page:-namespace pages are typically edited in a relatively brief flurry and mostly sequentially, and then never touched again. Only the project namespace, Wikisource:, really follows the normal pattern from e.g. Wikipedia:. A heat map of activity across wikipages would show small hotspots in an otherwise completely cool calm background. (Hmm. I should actually try to make that heatmap. It'd be an interesting visualisation of the differences between projects.))

As noted in T320640, this also causes confusing behavior with the watchlist, where if you visit any book page, the previous and next pages are also marked as visited.

matmarex added a parent task: T320640: Three pages appear are marked as seen in the watchlist when visiting only one of them (French Wikisource).Oct 13 2022, 6:39 PM

Change 844912 had a related patch set uploaded (by Tpt; author: Tpt):

[mediawiki/extensions/ProofreadPage@master] Drops prefetching previous and next pages

https://gerrit.wikimedia.org/r/844912

gerritbot added a project: Patch-For-Review.Oct 20 2022, 7:53 AM

Krinkle added a project: Performance-Team (Radar).Oct 20 2022, 2:43 PM

Krinkle moved this task from Limbo to Perf recommendation on the Performance-Team (Radar) board.

Change 844912 merged by jenkins-bot:

[mediawiki/extensions/ProofreadPage@master] Drops prefetching previous and next pages

https://gerrit.wikimedia.org/r/844912

Maintenance_bot removed a project: Patch-For-Review.Oct 21 2022, 8:30 AM

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.7; 2022-10-24).Oct 21 2022, 9:00 AM

Ladsgroup closed this task as Resolved.Oct 21 2022, 9:55 AM

Ladsgroup assigned this task to Tpt.

ProofreadPage frontend makes a request to the page before and after in every page viewClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

ProofreadPage frontend makes a request to the page before and after in every page view
Closed, ResolvedPublic
Actions

Related Objects
Search...