Use Commons (individual files?) as a source for building DjVu files
Open, MediumPublic20 Estimated Story Points
Actions

Assigned To

None

Authored By

	Yann
	Mar 26 2017, 2:19 PM

Description

Now for importing DjVu files used in Wikisource into Commons, a 2-step process is required, going through Internet Archive, and then uploading the file with https://tools.wmflabs.org/ia-upload/commons/init

It would be easier and more reliable if IA-Upload would produce DjVu files directly, from a set of individual files on Commons or an uploaded zip containing images.

Related Objects

Mentioned In: T200871: More improvements to ProofreadPage Extension and Wikisource
T73989: Installation of pdf2djvu
T161776: GSoC Proposal: Improvements to ProofreadPage Extension and Wikisource
T128840: Improvements to ProofreadPage Extension and Wikisource

Event Timeline

Yann created this task.Mar 26 2017, 2:19 PM

Restricted Application added a project: Internet-Archive. · View Herald TranscriptMar 26 2017, 2:19 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yann mentioned this in T128840: Improvements to ProofreadPage Extension and Wikisource.Mar 26 2017, 2:19 PM

Yann added a subscriber: Samwilson.Mar 26 2017, 2:25 PM

Do you mean that IA Upload would handle the upload to IA as well? So someone would go to IA Upload, upload their files, a new item would be created on IA, and then IA Upload would create the DjVu from that item and its OCR and upload the result to Commons?

It's been talked about a little bit, I think. The other way around would be upload the files to Commons, and then have IA Upload take them from there, send them through IA, and put the result up on Commons. Then, the user would never need to leave Commons.

Samwilson moved this task from Backlog to IA Upload on the All-and-every-Wikisource board.Mar 26 2017, 11:22 PM

No, I mean to shunt IA completely. The tool should create the DjVu and the OCR and upload the result to Commons. This would also improve reliability, as IA is a weak point in the process.

I think the issues with that could be:

the OCR from IA is of better quality than we can do on Labs (I think I'm right in saying they use Abby FineReader, we use Tesseract or Google Cloud Vision API);
and we want to retain the original full-res scans.

One workflow could be to upload all individual files to Commons, and then derive a DjVu from them (based on a category perhaps, or file naming standard). Although, once we have all files on Commons, the Index file on Wikisource can just be constructed with them instead; no need for the DjVu.

I think @Aubrey would have better ideas than me of the advantages of sticking with the IA.

Samwilson triaged this task as Medium priority.Mar 29 2017, 1:24 AM

I like the idea of IA because they have quite a simple upload tool, and it's the best and greatest digital library in the world. It's kinda easy, it has a great book-reader, quite easy navigability, it offers good OCR and a lot of derived files. It also has easy API for download and upload. It's a "library-platform", and unfortunately Commons it's not. You can't really find stuff in Commons, neither read it (have you tried reading a document on it?)

This is why I preach IA to librarians and archivists. It's important on its own, even if we then don't upload that book on Wikisource. If it ends right there, it's still better than not having a public digitization at all. Workflow is really, really important for real outreach and working with GLAMs.

Scott1006 mentioned this in T161776: GSoC Proposal: Improvements to ProofreadPage Extension and Wikisource.Mar 30 2017, 3:43 AM

Samwilson edited projects, added IA Upload; removed All-and-every-Wikisource.May 25 2017, 11:29 AM

amritsreekumar closed this task as Resolved.Jul 19 2017, 12:55 PM

@amritsreekumar I think we only use 'resolved' for things that have been actually fixed. If we're not going to do this feature it should be 'declined'.

@Yann are you okay with this being closed? I think a build-DjVu-from-Commons file feature would be accepted if someone were to write it.

If no code was fixed it's not resolved. :) See https://www.mediawiki.org/wiki/Bug_management/Bug_report_life_cycle

The issue still exists. I don't think it should be closed.

zhuyifei1999 moved this task from Incoming to Third-party software on the Commons board.Jul 25 2017, 9:41 AM

I've changed the title of this task — do you think that sounds okay? It's not a massive task, perhaps, but a pretty big new feature I think. Probably needs more discussion about quite what the goal is here.

Samwilson mentioned this in T73989: Installation of pdf2djvu.Nov 21 2017, 1:16 AM

• Elitre subscribed.Nov 21 2017, 8:53 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptNov 21 2017, 8:53 AM

Samwilson edited projects, added All-and-every-Wikisource; removed Internet-Archive.Dec 7 2017, 4:47 AM

I think, that if such tool is created, it should be able to support also other digital libraries that are JPG-oriented, like CBN Polona (https://polona.pl). Kind of per-library plugin support?

That's a great idea. A sort of combined BUB, that can use any digital library as a source, even Commons.

• TBolliger removed a project: Community-Tech.Mar 27 2018, 12:38 AM

srishakatux mentioned this in T200871: More improvements to ProofreadPage Extension and Wikisource.Aug 1 2018, 4:53 AM

From a technical perspective, the only clear advantage to the status quo would seem to be the OCR engine (ABBYY) that IA has access to but we (currently) do not.

But if we are thinking plugins/modular and are sufficiently ambitious, a theoretical new tool would support multiple OCR engines: Tesseract 3 and 4, the latter with multiple modes, Google Vision API, and possibly even ABBY Cloud OCR (if they'd be interested in donating access in future as Google does now).

The Wikisources would benefit from the ability to pick the best OCR engine / mode on a per-page and a per-work basis. For example, Tesseract 4 in default mode does the best overall job for the work as a whole, but for certain pages that use diacritics (éàï) and ligatures (æœç) you need Google Vision or ABBY. Or Google fails badly at multi-column text, but Tesseract handles it.

There are also some downsides to relying on IA; mainly, of course, being so reliant on a third party. But they also rely fundamentally on automated processes, where we rely on crowdsourced manual labour, that have a worryingly high failure rate (and operate on very low quality and sparse metadata). And since we currently are reliant on IA, whenever they fail in some way it becomes really hard to fix it (I just spent three days coding and compiling utilities/libraries because a .djvu had two bad pages).

In that perspective, the Wikimedia projects really should be self-sufficient, and then simply loosely integrate with IA, OpenLibrary, and other such sources of scans (inputs) and repositories of digital media (outputs).

And a first cut at that, without having to replicate IA's whole workflow and APIs etc., would be the ability to generate DjVus ourselves based on a bunch of JPEGs that happen to live on either IA or Commons or a .zip on a user's drive.

In T161456#3139116, @Samwilson wrote:

I think the issues with that could be:

the OCR from IA is of better quality than we can do on Labs (I think I'm right in saying they use Abby FineReader, we use Tesseract or Google Cloud Vision API);

In T161456#5064849, @Xover wrote:

From a technical perspective, the only clear advantage to the status quo would seem to be the OCR engine (ABBYY) that IA has access to but we (currently) do not.

But if we are thinking plugins/modular and are sufficiently ambitious, a theoretical new tool would support multiple OCR engines: Tesseract 3 and 4, the latter with multiple modes, Google Vision API, and possibly even ABBY Cloud OCR (if they'd be interested in donating access in future as Google does now).

As you may be aware, Internet Archive completed their transition from Abbyy OCR to Tesseract OCR in December of 2020, so since about 2021, the above comments are no longer really true. That said, it would be nice if we could get Abby to donate Abby Cloud OCR and we could run that on IA Upload or the like.
It makes sense to have separate tools for DjVu creation and for actual fetching from IA and uploading to Commons. Currently IA Upload uses phetools/pdf_to_djvu service for IA PDF to DjVu conversion (but the API is not particularly nice). I can see how a similar more generalized DjVu/OCR service would be useful and IA Upload could focus on handling IA using this service and uploading to Commons. Then others could also use the service with other sources besides IA.

Use Commons (individual files?) as a source for building DjVu filesOpen, MediumPublic20 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Use Commons (individual files?) as a source for building DjVu files
Open, MediumPublic20 Estimated Story Points
Actions