Page MenuHomePhabricator

Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source)
Open, Needs TriagePublicFeature

Description

What is the problem or limitation:

English Wikisource makes extensive use of scanned works. These are sourced from various locations including a large number that are hosted as scans by Internet Archive.

The current approach has been to use a provided djvu (or PDF) for this purpose, uploaded to Commons or English Wikisource, and in most situations this has sufficed for simple works.

However for some works (and scans) the quality of a PDF or djvu obtained is necessarily lower than the original scans (which are typically in a JPEG2000 format.) and this quality is affected by conversions between various formats in the toolchain (see T256848 T224355 for related issues.)

What is the functionality you would like:-

The ability to specify an IA style identifier in a suitable field on an Index: page, and regardless of a local file (PDF/DJVU) being hosted, have the back-end (such as ProofreadPage or Thumbor), pull the high quality JPEG2000 scan from an original on the IA server for the specified page on a dynamic basis, without the high quality scans needing to be hosted locally or on Commons, down converting the scan to conventional JPEG with minimal loss of quality as required. ( An image generated from a locally hosted PDF/DJVU would still be available as a fall-back.)

Archival links for the high quality page scans of a given work at IA are often of the form (example)
https://archive.org/download/catalogoftitleen11118libr/catalogoftitleen11118libr_orig_jp2.tar/catalogoftitleen11118libr_orig_jp2%2Fcatalogoftitleen11118libr_orig_0002.jp2

which looks like it could be generated from code, on a dynamic basis, assuming the relevant information can be obtained from an suitable page (such as a File: or Index:).

Having the highest quality scans available would also be useful for OCR purposes, where works do not necessarily contain a 'readable' text layer.

It should be noted that this is intended to be used only for works with reliable sourcing and licensing information, and which would otherwise be compatible if hosted locally on Commons/Wikisource.

An alternative approach would be to host the high quality scans on Commons directly, but that would need an automated task to identify and match existing uploaded DJVU and PDF, to their equivalent high quality scans at IA, and ensure the correctly paired scan-sets were matched up.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 3 2020, 9:04 AM
Xover added subscribers: Tpt, Xover.Jul 5 2020, 10:29 AM

Wouldn't it be easier and more robust to simply grab all the images directly on Commons? If MW doesn't already support JPEG2000 (.jp2; and I haven't checked) then either adding native support or transcoding to plain JPEG should be within the realm of reason (there's at least two OSS libraries for that), and a heck of a lot more robust then trying to dynamically fetch it from IA. We can create ProofreadPage Index: pages from individual images on Commons now, and if there was value in it I'm sure we could cook up some more fancy support (an index referencing a Category on Commons; or a virtual pseudo-file that just references those images; or built-in support for generating DjVus on demand; or …).

@Tpt Just adding a ping in case any of ^^^^ these ideas take your fancy enough to put on the "once upon a time" wishlist.

If you want to set up a mechanism to pull HUGE quantities of JPEG/JPEG2000/TIFF scans from IA, I'm more than open to the suggestions you mention here. :)

My thought in terms of doing it locally, was to retain any existing Djvu/PDF, but grab the individual scans for the pages under that identifier to a category (based on IA identifer/ or original Commons file-name.)

Can Index pages be created for a Commons Category, without needing to specify the pages directly? (If not then that would be another "wishlist" item.)

This feels like something that should be implemented client side, probably as a JavaScript gadget, not in MediaWiki or Thumbor. Having Thumbor reach out to arbitrary third-party URLs would pose security problems, as well as probably copyright problems. Implementation of the feature as described wouldn't really make sense in Thumbor either.

Indeed adding support for JPEG2000 on Commons is going to be much easier. Have you tried uploading a PDF with JPEG2000 images inside? You can use https://pypi.org/project/img2pdf/ to make one.

kamholz added a comment.EditedThu, Aug 13, 8:08 PM

This is related to something PanLex is currently doing in a gadget I've recently ported from Palmleaf.org. There's a community in Bali that's been doing Balinese palm-leaf manuscript transcription there and it's in the process of being moved to Wikisource. The manuscript scans all come from IA. I've already batch uploaded them to Commons using PDFs from IA.

The PDFs are OK for general use but are not high enough quality for transcription work, which requires the highest resolution original. In practice both versions are needed, because the high-res version is too large to be the main version, especially on slower connections as in Bali. Our solution has been to use IIIF at an endpoint that IA provides at iiif.archivelab.org. When the user zooms in far enough on a palm-leaf image, our interface will download a high-res version of the zoomed in region from IA over IIIF and overlay it on the main (lower-res) version.

If you want to try it out, you need to activate this user script by adding the following line to your common.js on wikisource.org (Multilingual Wikisource):

mw.loader.load('https://wikisource.org/w/index.php?title=User:Lautgesetz/common.js&action=raw&ctype=text/javascript');

Then, try it out by going to this page. Our interface should appear on top of the ProofreadPage interface and you can try zooming in and panning.

The main code for the gadget is here. In principle this functionality could be added to ProofreadPage, but porting it from React (or porting ProofreadPage to React) would be a fair amount of work.