Page MenuHomePhabricator

Batch Upload for The Strand
Closed, ResolvedPublic

Description

Could you batch up the remaining volumes of The Strand so that we can run the Sherlock Holmes for October MC and have all the volumes on WS? Much appreciated.

Event Timeline


Forgot the vollist in the previous version.

RhinosF1 subscribed.

@Aklapper: This isn't a server side upload as described there.

It's a wiki source book automatic upload I believe. Not entirely sure where it was decided or why phab is best for this.

A task like this can be a message on my talk page at enWS, but I don't mind it here at all (it's a task to track after all, and Phab is a good place for the CSV file anyway). ¯\_(ツ)_/¯

Inductiveload triaged this task as Medium priority.

Here are the remaining volumes for The Strand from the IA. I already uploaded them as PDF before I noticed that the files are missing an OCR layer. Would it be possible create an OCRed DJVU with uncompressed images so that they can easily be cropped?

I also noticed that Volume 20 has a better source on IA. Would it be possible to replace the existing DJVU?

Would it be possible create an OCRed DJVU with uncompressed images so that they can easily be cropped?

@Languageseeker No, DJVU doesn't support lossless embedding of JP2 files (they'll be re-encoded with a related, but not identical wavelet compression scheme). Or at least, if they can be I don't have a tool to do that.

As always, I recommend https://en.wikisource.org/wiki/User:Inductiveload/jump_to_file to locate a high-res file (it's actually going to give the JPG, but that's nearly always good enough, T290904 for linking JP2s) and then https://ws-image-uploader.toolforge.org/ for uploading.

Would it be possible to replace the existing DJVU.

It is quite possible, but do please try to tell me this before I generate a DJVU from Hathi because it does take some time and effort.

Of course, take your time. This seems far more complicated than I envisioned. Thank you for doing this.

As for creating the DJVU, I think that you're right. The JP2 will need to be re-encoded. My only concern is that it can be done it at high resolution so that the images can be cropped from the DJVU rather than having to go back to the raw JP2. Trying to avoid the potato quality of most IA DJVUs.

I'm sorry about Volume 20. I know that it takes a long time and I feel really bad about wasting yours. I accidentally overlooked that Volume 20 was on IA in a much better quality version. My apologies.

My only concern is that it can be done it at high resolution so that the images can be cropped from the DJVU

I don't think this is a particularly practical goal. It'd be much better to just use the upstream files for image extraction (and a more consistent workflow, since even if these DjVus were GB-sized lossless containers), 99% of them are not).

The DjVus I create use a completely different compressor to the one the IA (used to) uses. I use djvulibre's c44/cjb2 for JPG/bitonal and the IA used a LuraTech MRC compressor, which is why their DjVus are so slim, and why there's such strong foreground/background separation.

It would be better if CropTool could pull the image right off the IA/Hathi/... URL (basically JumpToFile for CropTool).

The other issue is that a straight crop is often not sufficient, so it's not really the thing to optimise a workflow for.

Don't worry about v20, I'll sort it out. Just keep it in mind in future: I'd rather do it later/out of order than do it twice.

It is not that complex, but because the Commons uploads are failing 70% of the time, it's slow going, and I keep hitting new corner cases! Eventually I'll get to a set-and-forget state!

Inductiveload changed the task status from Open to Stalled.Sep 28 2021, 10:15 PM

Following the SSU's and a recent config change that has made it (slightly) easier to upload, this is finally all uploaded.