Page MenuHomePhabricator

Add support for JP2 files
Open, LowPublic

Description

I was recently told that commons might find it useful to upload jp2 files, because some glams really like this format.

It would be very easy to add support to this to MW since image magick supports jp2, and in fact, we already support this in a sense, as PDFs often have embedded jp2 files which we already support decoding.

The only really question here I think, is how closely we have to validate that jp2 files are really jp2. The internet suggests that jpx (aka JPEG2000 part 2) files are sometimes mislabeled as JP2 (aka JPEG2000 part 1) files. JPX has less support than JP2 (although image magick claims to support both, I'm unsure how true that claim is), and there is more uncertainty about the patent situation with JPX. Which naturally leads to the question of if we should allow jpx, and if we don't, do we need to verify that jp2 files are really jp2 and not jpx. If we need to verify really jp2, there's a python script that does that at https://github.com/openpreserve/jpylyzer

Event Timeline

Something we should get clear: ingest only or hosting, raw serving or thumbnail conversion ?

I believe thumbnail conversion is what is wanted (Commons doesn't like formats they can't actually preview for vandalism. Most (all?) browsers do not have native jpeg2000 support).

Ultimately the question has to be put before the commons community in the usual fashion.

dr0ptp4kt moved this task from Untriaged to Triaged on the Multimedia board.
dr0ptp4kt added a subscriber: dr0ptp4kt.

We're triaging bugs this morning. Not pressing at the moment, but noting we've seen it.

Fae rescinded a token.
Fae awarded a token.
Fae rescinded a token.
Fae awarded a token.

How can we heat this up? Would kicking off a Commons proposal make any difference?

I'm asking as I have been sitting on completing a mass upload of highly valuable images from the Biodiversity Heritage Library. I've been sitting on it so long, that the infrastructure of Commons and BHL has changed and I'll have to rewrite my scripts to get it working again. If I know jp2 is being implemented, I'll invest my volunteer time in sorting it out; otherwise, it may as well stay parked rather than accepting slightly inferior versions of images as jpegs on Commons, when users can find the better jp2 versions on the Internet Archive.

yes, in addition some books are getting bounced at IAuploader for this reason (until IA starts converting to dejavu again)

This has come up again at https://commons.wikimedia.org/wiki/Commons:Village_pump/Proposals#Commons_should_support_JP2_file_format

From the Thumbor perspective, this seems fairly simple as it is already supported by imagemagick.

This is absolutely critical for Wikisource because the millions of books on the Internet Archive have their original files encoded in jp2. Without this, we have to rely on overly compressed pdf or djvu files that are illegible at times and are too poor quality for the crop tool. Please make this happen.

https://www.loc.gov/preservation/digital/formats/fdd/fdd000143.shtml
Magic numbers: 00 00 00 0C 6A 50 20 20 0D 0A 87 0A 00 00 00 14 66 74 79 70 6A 70 32
Mime types: image/jp2, image/jpeg2000, image/jpeg2000-image, image/x-jpeg2000-image

And follow steps from here: https://www.mediawiki.org/wiki/Manual:Adding_support_for_new_filetypes#Support_for_uploads

JP2 is supported apparently by php's media functions, so retrieving the bare minimum of metadata (width and height) won't be a problem either. More advanced metadata processing is probably more difficult, mostly because it's a less familiar format, and because in theory it can have multiple images, but not required for critical functionality I guess.

I'd say take a good look at the WebPHandler and repeat that but for JP2.

Sounds like a grant-worthy project, easy to scope. If someone is interested in getting paid to build this, let me know and I can figure out who to pitch it to at the Foundation.

This is absolutely a grant-worthy project. Thank you for agreeing to pitch it. It's badly needed. Recently, at English Wikisource, one of the the Administrators, Inductiveload, had to write a js to allow for the loading of images from outside of Wikisource because the PDF images are illegible. https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Increasing_the_resolution_of_a_PDF_in_edit_mode

However, if we are to implement this, then there is more to do than just support JP2. I don't think any of it is too hard, but it will need to be done to make the JP2 support usable on Wikisource.

Subtask 1: Importer for IA zip files
Rationale: The Internet Archive packages books in zip files, it makes no sense to unzip the files and ask patrollers to review hundreds of pages that are all derived from the same source.
Further Rationale: Adding books is currently a nightmare requiring the usage of Wikisource, a toolforge site, and Wikicommons. It's been a long standing community request for a better upload tool.

  1. The tool should take an IA id or url and walk the user through the steps of adding a book. It should also have the potential for future expansion to support other sites. It should also have an advance setting to allow the user to select the zip file because some books have multi zips of which one may be better than the other in special cases.
  2. It should also grab relevant files such as IA_ID_bhlmets.xml IA_ID_dc.xml IA_ID_marc.xml IA_ID_meta.xml IA_ID_toc.xml and parse them. It might also make sense to pull the Open Library record instead of the raw marc from IA.
  3. Uploading the jp2 archive will take time and the user should be notified when completed

Subtask 2: The ability to load the images for a project on the Index ns from the zip file.
Rationale: If we are to upload the JP2 zip from IA there is a need to be able to use it.

Subtask 3: A bot to replace the PDF/DVJU files with the JP2 Files
Rationale: Current project are in the form of either a PDF/DJVU. We'll need a bot to automatically replace the images extracted from the PDF/DJVU with those from the JP2 archive.

Subtask 4: A tool to shift pages
Rationale: Switching from PDF/DJVU to the original JP2 files can cause the page scans to become out of sync with the transcribed text. A tool is need to allow an administrator or other group to easily shift the transcribed text for an entire work or a selected range.

Change 671758 had a related patch set uploaded (by TheDJ; owner: TheDJ):
[mediawiki/core@master] Basic JPEG2000 handler

https://gerrit.wikimedia.org/r/671758

Change 671757 had a related patch set uploaded (by TheDJ; owner: TheDJ):
[mediawiki/core@master] JPEG2000 MIME fixes

https://gerrit.wikimedia.org/r/671757

There's the basics.. I'd say we can/should merge the mime types stuff regardless, the jpeg2000 handler isn't so useful for WMF purposes right now, but works if you just want to render thumbnails on your standard MediaWiki install.

Change 671757 merged by jenkins-bot:
[mediawiki/core@master] JPEG2000 MIME fixes

https://gerrit.wikimedia.org/r/671757

@Languageseeker it's just a very basic start but I have no real interest in pursuing any of the other parts of this myself. Most of your requests in T161934#6907102 should be filed as separate tickets in Phabricator as they do not directly relate to this particular issue, btw.

And I have significant concerns about uploading full jpeg2000 IA copies of books. The data usage seems excessive for the goals of wikimedia and might pose significant problems. Wikimedia is not the internet archive and it's infrastructure in it's current form is not suitable for including very high resolution scans of that many books in my opinion. Similar to how geo data belongs in openstreetmap whenever possible, originals scans of so many books, should probably just remain in the Internet Archive as they specialise in that and competing with specialisation tends to be near impossible and doesn't scale.

@TheDJ I understand that it’s a small step forwards, but it’s more progress that we’ve had in years. So, it’s a small victory, but one worth celebrating none the less.

I’ll split my comments off to others tickets.

As far as your concerns about duplicating the IA, I don’t think anyone is truly trying to do that. Our goal is to take the books, correct the machine generated OCR, extract the images, format the text, and generate an ebook that matches the source. For that, we need images that are readable and have a high enough resolution for cropping. The DJVU and PDF files are insufficient for image extraction and sometimes even for text reading. Furthermore, sometimes the generated PDF/DJVU files are missing pages present in the zip archive. Although importing the zip files will increase the storage size by a factor of approximately ten, it will not significantly increase the network bandwidth because such images will rarely be accessed. Allowing direct access to the JP2 files will have a significant positive impact on the Wikisource community without significantly taxing the resources of wikimedia.