Page MenuHomePhabricator

Issue with PDFs downloaded from Archive.org
Closed, DeclinedPublic

Description

Author: shijualex

Description:
Hi

I am finding some issues with the PDFs downloaded from https://archive.org when we associate it with Proofread extension in Wikisource.

For example, see this file at Archive.org https://archive.org/details/pazhancholmala_gundert_1845 This file can viewed properly and downloaded from Archive.org.

I downloaded this file and uploaded to Commons. https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf From Commons also if we download that file we can view it properly and can read it.

Now there are 2 issues with Commons/Proofread/Mediawiki

  1. Inside Commons itself (for example, https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf) you can see that you cannot view the pages from this file in higher resolution.
  1. When we create Index file in Wikisource (for example, https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try to work on a page (for example, https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.pdf/7&action=edit&redlink=1) you can see that nothing much can be seen on the scanned page.

The second issue might be the direct consequence of issue 1. Could you please look into this issue.

I suspect the issue is closely related to the PDF generation method at Archive.org. But I am not sure about that also since the PDF file as a whole is perfectly fine.


Version: unspecified
Severity: minor

Details

Reference
bz57278

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 2:25 AM
bzimport set Reference to bz57278.
bzimport added a subscriber: Unknown Object (MLST).

I suspect that this probably isn't a ProofreadPage issue but one of either the PdfHandler extension, or more likely one related to the tool doing the PDF page rendering to images on the wikimedia image scalers.. Those being ghostscript and imagemagick

Moving to PdfHandler for the time being.

Software on cluster:
reedy@tin:/a/common$ dpkg -l | grep ghostscript
ii ghostscript 9.05~dfsg-0ubuntu4.2 interpreter for the PostScript language and for PDF
ii gs-cjk-resource 1.20100103-3 Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@tin:/a/common$ dpkg -l | grep imagemagick
ii imagemagick 8:6.6.9.7-5ubuntu3.2 image manipulation programs
ii imagemagick-common 8:6.6.9.7-5ubuntu3.2 image manipulation programs -- infrastructure

I note a similar output locally too on my dev wiki

reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep ghostscript
ii ghostscript 9.10~dfsg-0ubuntu2 amd64 interpreter for the PostScript language and for PDF
ii gs-cjk-resource 1.20100103-3 all Resource files for gs-cjk, ghostscript CJK-TrueType extension
reedy@ubuntu64-web-esxi:/var/www/wiki/mediawiki/core$ dpkg -l | grep imagemagick
ii imagemagick 8:6.7.7.10-5ubuntu3 amd64 image manipulation programs
ii imagemagick-common 8:6.7.7.10-5ubuntu3 all image manipulation programs -- infrastructure

Hopefully it can get triaged a little before being dumped onto the WMF image scaler component....

shijualex wrote:

Able to reproduce issue with another PDF downloaded from Archive.org https://ml.wikisource.org/w/index.php?title=Page:Dharmaraja_1913.pdf/11&action=edit&redlink=1 Even though, in this case, we can just able to read content (with some difficulty), it is not good enough for Wikisource digitization efforts.

(In reply to comment #0)

  1. Inside Commons itself (for example,

https://commons.wikimedia.org/wiki/File:Pazhancholmala_Gundert_1845.pdf) you
can see that you cannot view the pages from this file in higher resolution.

How is this unexpected? The PDF has low resolution (and it's only 2 MB), it's correctly displayed.

$ pdfinfo Gundert_Pazhancholmala_1845.pdf
Title: Pazhancholmala by Hermann Gundert 1845
Keywords: http://archive.org/details/pazhancholmala_gundert_1845
Author: Hermann Gundert
Creator: Digitized by the Internet Archive
Producer: Recoded by LuraDocument PDF v2.53
CreationDate: Mon Sep 16 16:22:18 2013
ModDate: Mon Sep 16 16:23:29 2013
Tagged: no
Form: none
Pages: 147
Encrypted: no
Page size: 91 x 148 pts
Page rot: 0
File size: 2363482 bytes
Optimized: yes
PDF version: 1.5

https://catalogd.archive.org/log/177773313 tells me:
Source Gundert_Pazhancholmala_1845_images.zip : "Generic Raw Book Zip"
[...]
INFO: Global image dpi: 600

It's possible that the resolution was guessed incorrectly (unless the pages of this book are very small, 147 pages at 600 dpi can't be 35 MB only): please edit the metadata to add the correct one at which the images were produced, see fixed-ppi instructions at https://en.wikisource.org/wiki/Help:DjVu_files#The_Internet_Archive

  1. When we create Index file in Wikisource (for example,

https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
to work on a page (for example,
https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
pdf/7&action=edit&redlink=1)
you can see that nothing much can be seen on the scanned page.

What is that you don't see there? The text isn't loaded but this is expected because as you know very well there is no OCR. I also see the image from the PDF correctly, in my case https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Dharmaraja_1913.pdf/page11-500px-Dharmaraja_1913.pdf.jpg which according to wget -S is Last-Modified: Wed, 20 Nov 2013 03:16:52 GMT so may have been created when someone else clicked the link on comment 0. Do you still not see an image there? If you don't, is it consistent on all pages?

Shiju Alex: Can you please answer Nemo's questions in comment 3?:

What is that you don't see there? Do you
still not see an image there? If you don't, is it consistent on all pages?

Looking for actionable items, I currently only see this:

(In reply to comment #3 by Nemo)

It's possible that the resolution was guessed incorrectly

(In reply to comment #0)

  1. When we create Index file in Wikisource (for example,

https://ml.wikisource.org/wiki/Index:Pazhancholmala_Gundert_1845.pdf) and try
to work on a page (for example,
https://ml.wikisource.org/w/index.php?title=Page:Pazhancholmala_Gundert_1845.
pdf/7&action=edit&redlink=1)
you can see that nothing much can be seen on the scanned page.

Are you sure that the pff has an ocr layer (you can test by opening up in a pdf viewer and seeing if you can select/copy text in the document)? Pdfhandler seems to think all the pages are blank - https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Pazhancholmala_Gundert_1845.pdf (scroll down to the text property)

Closing worksforme.

I downloaded the file, and looked at it with various tools:
*The text layer appears to be empty, It has no OCR data, hence proofread page cannot retrieve the text of the document. (Proofread page doesn't do OCR, it only extracts what is embedded in the document)
*The file does have a low resolution. Other PDF tools also display it very small.

(If you think there's still a bug here, please re-open)