Page MenuHomePhabricator

Add a better/more capable PDF/DJVU/Multipage file viewer
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where): The current multipage file viewing interface is old and basically unmaintained, and is insufficient in serving the purpose of viewing multipage files (due to the extremely small size of the previews)

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution): Anyone doublechecking uploads of pdf/djvu files on Commons, Wikisource users, anyone who wishes to read PDF files on commons

Benefits (why should this be implemented?):

A) The current viewer is a mess (from a technical POV) and uses techniques that were state of the art 9 years ago, it tries to parse html and builds a seperate in memory cache for HTML/Images?
B) Having a newer interface will allow contributors to not have to download the files on their own computers to view the files properly

Notes
This was recently dicussed in the Wikisource global telegram group

Details

Related Changes in Gerrit:

Event Timeline

Soda updated the task description. (Show Details)
Soda edited subscribers, added: VIGNERON, Bodhisattwa; removed: Ruthven.

It's beyond the scope of this task, but I'm dropping a note about it here as it's a relevant factor to consider in designing and implementing a new modernized multipage viewer. This will appear in a separate Phab task at some point when I've chewed it over sufficiently to make sense. In ay case…

Our current approach of wrapping scans in a container like PDF or DjVu is both awkward and unsustainable: PDFs are not designed for our purposes (and it shows in lots of little problems and poor quality), and DjVu, while an excellent file format, has extremely limited upstream support (no ecosystem, no new feature development, limited number of not user-friendly tools, etc.). In addition, architecturally speaking, PDF and DjVu are both derived products from individual image originals (.jp2 mostly at IA, PNG and JPEG at GBooks, etc.).

In addition, non-Wikisource projects usually need a single page (title page, frontispiece, plate, etc.) while the Wikisourcen frequently need to go back to the original images to extract graphical elements or suffer generational loss extracting it (usually in a not very user friendly way) from the PDF/DjVu. Any time we need to reorder pages we have to download, extract, modify, reupload; vs. something as simple as changing a sortkey.

All this points toward a future where we stop uploading DjVu/PDF as the primary / preferred format for scans, and instead upload raw upstream scans as .jp2/.jpg/.png. We'll need MediaWiki changes of some sort for that. Maybe as simple as a category plus a magic word to trigger the page viewer and provide a single endpoint for PRP. Possibly MCR with 1k+ slots, but more likely MCR would be used to store the text layer for each image. Possibly a whole new content model / namespace to provide the container features. This is the bit that's still kinda hazy (feel free to hit me up on my talk if you're interested in this).

In any case… this is way way beyond the scope of this task, but for anyone tackling this task it might be worthwhile to keep in mind the possibility of a future where the multipage file viewer is not backed by a single PDF/DjVu file but rather a "something" that is a thin software layer encapsulating (providing a container for) actually separate page images. I don't think there's any conflict there, but it might make it worthwhile to consider abstracting the source of the images a little more from the current backend.

All this points toward a future where we stop uploading DjVu/PDF as the primary / preferred format for scans, and instead upload raw upstream scans as .jp2/.jpg/.png.

Hear hear. I've had similar thoughts for years. Especially as we promote more digitization and transcription of manuscripts where we want people to be storing the full original files on Commons, it's going to be necessary to have better ways to support jpeg-backed scans (I think the most frustrating part of it I find at the moment is the pagelist creation).

But yeah, as you say, this task can still do a lot to improve the ways that we work with multipage files.

Change 960229 had a related patch set uploaded (by Sohom Datta; author: Sohom Datta):

[mediawiki/core@master] Add thumbnail size as config variables

https://gerrit.wikimedia.org/r/960229

@TheDJ Continuing the conversation at the patch here it doesn't seem to be possible to pass only the page number to the imageinfo API this is what I get when I try to request the url for page 15 of a book :(

Change 960229 abandoned by Sohom Datta:

[mediawiki/core@master] Add thumbnail size as config variables

Reason:

per discussion, this isn't usefull for the task

https://gerrit.wikimedia.org/r/960229