Page MenuHomePhabricator

Use Commons (individual files?) as a source for building DjVu files
Open, MediumPublic20 Estimated Story Points

Description

Now for importing DjVu files used in Wikisource into Commons, a 2-step process is required, going through Internet Archive, and then uploading the file with https://tools.wmflabs.org/ia-upload/commons/init

It would be easier and more reliable if IA-Upload would produce DjVu files directly, from a set of individual files on Commons or an uploaded zip containing images.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Do you mean that IA Upload would handle the upload to IA as well? So someone would go to IA Upload, upload their files, a new item would be created on IA, and then IA Upload would create the DjVu from that item and its OCR and upload the result to Commons?

It's been talked about a little bit, I think. The other way around would be upload the files to Commons, and then have IA Upload take them from there, send them through IA, and put the result up on Commons. Then, the user would never need to leave Commons.

No, I mean to shunt IA completely. The tool should create the DjVu and the OCR and upload the result to Commons. This would also improve reliability, as IA is a weak point in the process.

I think the issues with that could be:

  • the OCR from IA is of better quality than we can do on Labs (I think I'm right in saying they use Abby FineReader, we use Tesseract or Google Cloud Vision API);
  • and we want to retain the original full-res scans.

One workflow could be to upload all individual files to Commons, and then derive a DjVu from them (based on a category perhaps, or file naming standard). Although, once we have all files on Commons, the Index file on Wikisource can just be constructed with them instead; no need for the DjVu.

I think @Aubrey would have better ideas than me of the advantages of sticking with the IA.

Samwilson triaged this task as Medium priority.Mar 29 2017, 1:24 AM

I like the idea of IA because they have quite a simple upload tool, and it's the best and greatest digital library in the world. It's kinda easy, it has a great book-reader, quite easy navigability, it offers good OCR and a lot of derived files. It also has easy API for download and upload. It's a "library-platform", and unfortunately Commons it's not. You can't really find stuff in Commons, neither read it (have you tried reading a document on it?)

This is why I preach IA to librarians and archivists. It's important on its own, even if we then don't upload that book on Wikisource. If it ends right there, it's still better than not having a public digitization at all. Workflow is really, really important for real outreach and working with GLAMs.

Samwilson added a subscriber: amritsreekumar.

@amritsreekumar I think we only use 'resolved' for things that have been actually fixed. If we're not going to do this feature it should be 'declined'.

@Yann are you okay with this being closed? I think a build-DjVu-from-Commons file feature would be accepted if someone were to write it.

The issue still exists. I don't think it should be closed.

Samwilson renamed this task from Creating DjVu files with IA-Upload to Use Commons (individual files?) as a source for building DjVu files.Sep 12 2017, 12:14 PM
Samwilson updated the task description. (Show Details)
Samwilson set the point value for this task to 20.

I've changed the title of this task — do you think that sounds okay? It's not a massive task, perhaps, but a pretty big new feature I think. Probably needs more discussion about quite what the goal is here.

I think, that if such tool is created, it should be able to support also other digital libraries that are JPG-oriented, like CBN Polona (https://polona.pl). Kind of per-library plugin support?

That's a great idea. A sort of combined BUB, that can use any digital library as a source, even Commons.

From a technical perspective, the only clear advantage to the status quo would seem to be the OCR engine (ABBYY) that IA has access to but we (currently) do not.

But if we are thinking plugins/modular and are sufficiently ambitious, a theoretical new tool would support multiple OCR engines: Tesseract 3 and 4, the latter with multiple modes, Google Vision API, and possibly even ABBY Cloud OCR (if they'd be interested in donating access in future as Google does now).

The Wikisources would benefit from the ability to pick the best OCR engine / mode on a per-page and a per-work basis. For example, Tesseract 4 in default mode does the best overall job for the work as a whole, but for certain pages that use diacritics (éàï) and ligatures (æœç) you need Google Vision or ABBY. Or Google fails badly at multi-column text, but Tesseract handles it.

There are also some downsides to relying on IA; mainly, of course, being so reliant on a third party. But they also rely fundamentally on automated processes, where we rely on crowdsourced manual labour, that have a worryingly high failure rate (and operate on very low quality and sparse metadata). And since we currently are reliant on IA, whenever they fail in some way it becomes really hard to fix it (I just spent three days coding and compiling utilities/libraries because a .djvu had two bad pages).

In that perspective, the Wikimedia projects really should be self-sufficient, and then simply loosely integrate with IA, OpenLibrary, and other such sources of scans (inputs) and repositories of digital media (outputs).

And a first cut at that, without having to replicate IA's whole workflow and APIs etc., would be the ability to generate DjVus ourselves based on a bunch of JPEGs that happen to live on either IA or Commons or a .zip on a user's drive.

I think the issues with that could be:

  • the OCR from IA is of better quality than we can do on Labs (I think I'm right in saying they use Abby FineReader, we use Tesseract or Google Cloud Vision API);

From a technical perspective, the only clear advantage to the status quo would seem to be the OCR engine (ABBYY) that IA has access to but we (currently) do not.

But if we are thinking plugins/modular and are sufficiently ambitious, a theoretical new tool would support multiple OCR engines: Tesseract 3 and 4, the latter with multiple modes, Google Vision API, and possibly even ABBY Cloud OCR (if they'd be interested in donating access in future as Google does now).

As you may be aware, Internet Archive completed their transition from Abbyy OCR to Tesseract OCR in December of 2020, so since about 2021, the above comments are no longer really true. That said, it would be nice if we could get Abby to donate Abby Cloud OCR and we could run that on IA Upload or the like.
It makes sense to have separate tools for DjVu creation and for actual fetching from IA and uploading to Commons. Currently IA Upload uses phetools/pdf_to_djvu service for IA PDF to DjVu conversion (but the API is not particularly nice). I can see how a similar more generalized DjVu/OCR service would be useful and IA Upload could focus on handling IA using this service and uploading to Commons. Then others could also use the service with other sources besides IA.