Page MenuHomePhabricator

Use Commons (individual files?) as a source for building DjVu files
Open, NormalPublic20 Story Points

Description

Now for importing DjVu files used in Wikisource into Commons, a 2-step process is required, going through Internet Archive, and then uploading the file with https://tools.wmflabs.org/ia-upload/commons/init

It would be easier and more reliable if IA-Upload would produce DjVu files directly, from a set of individual files on Commons or an uploaded zip containing images.

Event Timeline

Yann created this task.Mar 26 2017, 2:19 PM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptMar 26 2017, 2:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Yann added a subscriber: Samwilson.Mar 26 2017, 2:25 PM

Do you mean that IA Upload would handle the upload to IA as well? So someone would go to IA Upload, upload their files, a new item would be created on IA, and then IA Upload would create the DjVu from that item and its OCR and upload the result to Commons?

It's been talked about a little bit, I think. The other way around would be upload the files to Commons, and then have IA Upload take them from there, send them through IA, and put the result up on Commons. Then, the user would never need to leave Commons.

Samwilson moved this task from Backlog to IA Upload on the Wikisource board.Mar 26 2017, 11:22 PM
Yann added a comment.Mar 28 2017, 5:52 PM

No, I mean to shunt IA completely. The tool should create the DjVu and the OCR and upload the result to Commons. This would also improve reliability, as IA is a weak point in the process.

I think the issues with that could be:

  • the OCR from IA is of better quality than we can do on Labs (I think I'm right in saying they use Abby FineReader, we use Tesseract or Google Cloud Vision API);
  • and we want to retain the original full-res scans.

One workflow could be to upload all individual files to Commons, and then derive a DjVu from them (based on a category perhaps, or file naming standard). Although, once we have all files on Commons, the Index file on Wikisource can just be constructed with them instead; no need for the DjVu.

I think @Aubrey would have better ideas than me of the advantages of sticking with the IA.

Samwilson triaged this task as Normal priority.Mar 29 2017, 1:24 AM

I like the idea of IA because they have quite a simple upload tool, and it's the best and greatest digital library in the world. It's kinda easy, it has a great book-reader, quite easy navigability, it offers good OCR and a lot of derived files. It also has easy API for download and upload. It's a "library-platform", and unfortunately Commons it's not. You can't really find stuff in Commons, neither read it (have you tried reading a document on it?)

This is why I preach IA to librarians and archivists. It's important on its own, even if we then don't upload that book on Wikisource. If it ends right there, it's still better than not having a public digitization at all. Workflow is really, really important for real outreach and working with GLAMs.

Samwilson edited projects, added IA Upload; removed Wikisource.May 25 2017, 11:29 AM
amritsreekumar closed this task as Resolved.Jul 19 2017, 12:55 PM
Samwilson reopened this task as Open.Jul 19 2017, 10:38 PM
Samwilson added a subscriber: amritsreekumar.

@amritsreekumar I think we only use 'resolved' for things that have been actually fixed. If we're not going to do this feature it should be 'declined'.

@Yann are you okay with this being closed? I think a build-DjVu-from-Commons file feature would be accepted if someone were to write it.

Yann added a comment.Jul 20 2017, 5:16 PM

The issue still exists. I don't think it should be closed.

Samwilson renamed this task from Creating DjVu files with IA-Upload to Use Commons (individual files?) as a source for building DjVu files.Sep 12 2017, 12:14 PM
Samwilson updated the task description. (Show Details)
Samwilson set the point value for this task to 20.

I've changed the title of this task — do you think that sounds okay? It's not a massive task, perhaps, but a pretty big new feature I think. Probably needs more discussion about quite what the goal is here.

Elitre added a subscriber: Elitre.Nov 21 2017, 8:53 AM
Restricted Application added a project: Community-Tech. · View Herald TranscriptNov 21 2017, 8:53 AM
Ankry added a subscriber: Ankry.Dec 7 2017, 8:43 AM

I think, that if such tool is created, it should be able to support also other digital libraries that are JPG-oriented, like CBN Polona (https://polona.pl). Kind of per-library plugin support?

That's a great idea. A sort of combined BUB, that can use any digital library as a source, even Commons.

Xover added a subscriber: Xover.Mar 28 2019, 7:45 AM

From a technical perspective, the only clear advantage to the status quo would seem to be the OCR engine (ABBYY) that IA has access to but we (currently) do not.

But if we are thinking plugins/modular and are sufficiently ambitious, a theoretical new tool would support multiple OCR engines: Tesseract 3 and 4, the latter with multiple modes, Google Vision API, and possibly even ABBY Cloud OCR (if they'd be interested in donating access in future as Google does now).

The Wikisources would benefit from the ability to pick the best OCR engine / mode on a per-page and a per-work basis. For example, Tesseract 4 in default mode does the best overall job for the work as a whole, but for certain pages that use diacritics (éàï) and ligatures (æœç) you need Google Vision or ABBY. Or Google fails badly at multi-column text, but Tesseract handles it.

There are also some downsides to relying on IA; mainly, of course, being so reliant on a third party. But they also rely fundamentally on automated processes, where we rely on crowdsourced manual labour, that have a worryingly high failure rate (and operate on very low quality and sparse metadata). And since we currently are reliant on IA, whenever they fail in some way it becomes really hard to fix it (I just spent three days coding and compiling utilities/libraries because a .djvu had two bad pages).

In that perspective, the Wikimedia projects really should be self-sufficient, and then simply loosely integrate with IA, OpenLibrary, and other such sources of scans (inputs) and repositories of digital media (outputs).

And a first cut at that, without having to replicate IA's whole workflow and APIs etc., would be the ability to generate DjVus ourselves based on a bunch of JPEGs that happen to live on either IA or Commons or a .zip on a user's drive.