Page MenuHomePhabricator

Installation of pdf2djvu
Closed, DuplicatePublic

Description

Would someone be so kind to install pdf2djvu on labs. Thanks.


Version: unspecified
Severity: normal

Details

Reference
bz71989

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:49 AM
bzimport added a project: Tools.
bzimport set Reference to bz71989.
bzimport added a subscriber: Unknown Object (MLST).

Providing usecases is welcome. :)

I have occasional use to convert public domain files found in pdf format. I need to trim components, and we have PDFtk on labs that suits that purpose. While they can be loaded to Commons for Wikisource as PDF files, they are generally inferior in retaining line by line text so not as useful for Wikisources. This will enable me to grab, trim, and convert files from labs, then push to Commons.

An example is https://commons.wikimedia.org/wiki/File:Electoral_Disabilities_of_Women.pdf which I have uploaded, though due to the poor pdf rendering, I am needing to separately OCR (PITA).

At the moment, I am pulling in one or two files a week.

billinghurst, I installed djvudigital in /data/project/phetools to convert pdf to djvu, conversion fail in some rare case of half broken pdf but it's enough stable to use it. The script to use it is https://github.com/phil-el/phetools/blob/master/ocr/pdf_to_djvu.py I can help you on IRC to setup it.

@Billinghurst , is there a need for this still, or is this resolved?

I was able to convert with Phe's tools, though it lost text layers and I had to OCR each page

We now have a tool at Labs that allows the import and creation, though it is still quirky in its implementation.

As archive.org no longer creates DJVU, I think that it is desirable, though if it is just problematic, comfortable with it being culled, and reopened if ever needed into the future.

Elitre subscribed.

I'm at a Wikisource workshop and I hear external tools like http://djvu.org/any2djvu/ recommended to create such files from PDFs. Is this really a step we need to rely on external sites for?

That's what people used to do before Internet Archive became widely used (and now again I guess), see http://en.wikisource.org/wiki/Help:DjVu_files

On which environment should this be installed? It's already on toolforge bastion:

09:50:33 0 ✓ zhuyifei1999@tools-bastion-02: ~$ apt search pdf2djvu
Sorting... Done
Full Text Search... Done
pdf2djvu/trusty,now 0.7.17-3ubuntu2 amd64 [installed]
  PDF to DjVu converter

09:50:51 0 ✓ zhuyifei1999@tools-bastion-02: ~$ which pdf2djvu
/usr/bin/pdf2djvu
09:50:57 0 ✓ zhuyifei1999@tools-bastion-02: ~$

I believe that this is used by ia-upload on tools.wmflabs to create djvu files as part of its machinations for PDF files from archive.org since they stopped generating djvu files
@Samwilson ?

I note that phe had it installed and I have used it at Tools directly, though my manipulation was less than ideal as I lost text layers per page, so I just backed away slowly due to my incompetence with the tool.

IA Upload doesn't convert PDFs itself; it hands that off to http://tools.wmflabs.org/phetools/pdf_to_djvu_cgi.py (which uses pdf2djvu) and then retrieves the resulting DjVu for upload to Commons.

Can pdf2djvu usage be changed to keep the text layer? Should we move this processing into ia-upload?

So https://en.wikisource.org/wiki/Help:DjVu_files#Online_.28.5Balmost.5D_all_systems.29 looks outdated, IA doesn't create the djvu anymore, and it's not that one necessarily wants a book from IA - that's to say that if https://tools.wmflabs.org/ia-upload/commons/init works that's great but that's also just one use case. So I dunno if my question was really answered here :)

It's usually encouraged that works be uploaded to IA before being transferred from there to Commons, because in this way we get the benefit of their OCR and are adding to the large library of scanned works there. This should probably be made more explicit in the documentation...

That's just one detail in the page, easily fixed. The point is: yes, Wikisource users and have relied on a number of tools for the conversion to DjVu, all of which (as far as I know) are documented on that page.

Installing djvu-related packages on Toolforge seems a nobrainer to me, since so many tools may need them, and in fact is has already happened as far as I know. If more packages are needed, one can ask them.