Page MenuHomePhabricator

PDF image extraction fails
Closed, ResolvedPublic

Description

Author: lars

Description:
On Wikimedia Commons (i.e. the version running there), the file
File:Finlands Allmänna Tidning 1820-01-03.pdf
doesn't show any page images.
Adding ?action=purge to the URL doesn't help.

No explanation is given.
In an offline reader, the PDF looks fine.

If the images are encoded in a way that MediaWiki can't handle,
the user would be helped by an error message that gives
instructions on which image encodings are supported.


Version: unspecified
Severity: normal
URL: https://commons.wikimedia.org/w/index.php?title=Commons:Bots/Work_requests&oldid=98361382#Fix_some_invalid_PDFs

Details

Reference
bz23326

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:01 PM
bzimport set Reference to bz23326.

lars wrote:

On the Commons:Village_pump I was told how to view the
error message (this was not trivial and leaves room
for improvement). Apparently:

Error creating thumbnail: GPL Ghostscript 8.61: Unrecoverable error, exit code 1
convert: no decode delegate for this image format `/tmp/magick-XXP4reva'.

However, the offline PDF viewer "evince" that comes with
Ubuntu Linux had no problem to view this PDF (images+text),
and "pdfimages" also succeeds to extract the images,
so it should be possible with free software.

The error message makes it slightly sound like a font problem maybe(?) since according to the ghostscript faq ( http://pages.cs.wisc.edu/~ghost/doc/gnu/7.05/Issues.htm ):

When CIDFont-CMap pair required by PDF file is not available GS fails with:
/undefinedresource in --findresource--

and theres all sorts of font related stuff on the operhand stack, but i don't know much about pdfs, so that is a wild geuss.


Anyways, here's the actual output from ghostscript when run on the command line (page 1 seems to print fine before it all blows up):

Processing pages 1 through 4.
Page 1
Substituting CID font resource/Adobe-Identity for /Arial.
Error: /undefinedresource in findresource
Operand stack:

--nostringval--   --dict:8/17(L)--   FontU   56.41   --dict:6/6(L)--   --dict:6/6(L)--   ArialUnicodeMS-Identity-H   --dict:9/12(ro)(G)--   --nostringval--   --dict:6/6(L)--   --dict:6/6(L)--   Adobe-Identity   CIDFont   Adobe-Identity

Execution stack:

%interp_exit   .runexec2   --nostringval--   --nostringval--   --nostringval--   2   %stopped_push   --nostringval--   --nostringval--   --nostringval--   false   1   %stopped_push   1905   1   3   %oparray_pop   1904   1   3   %oparray_pop   1888   1   3   %oparray_pop   --nostringval--   --nostringval--   2   1   4   --nostringval--   %for_pos_int_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--   false   1   %stopped_push   --nostringval--   %loop_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %array_continue   --nostringval--   --nostringval--   --nostringval--   --nostringval--   --nostringval--   %loop_continue   --nostringval--   1856   13   10   %oparray_pop   findresource   %errorexec_pop   --nostringval--   --nostringval--   --nostringval--

Dictionary stack:

--dict:1151/1684(ro)(G)--   --dict:1/20(G)--   --dict:97/200(L)--   --dict:97/200(L)--   --dict:108/127(ro)(G)--   --dict:275/300(ro)(G)--   --dict:22/25(L)--   --dict:4/6(L)--   --dict:21/40(L)--   --dict:6/8(L)--   --dict:38/40(ro)(G)--

Current allocation mode is local
Last OS error: 2
GPL Ghostscript 8.62: Unrecoverable error, exit code 1

lars wrote:

No fonts should be needed to extract scanned images from a PDF, so maybe the use of Ghostscript is the problem, and we should use pdfimages instead?

M8R-udfkkf wrote:

I also have a pdf that isn't thumnbnailing at commons:
File:EAA2 Mississippi River Delta.pdf

When I try to create a thumnail, it gives
"Error creating thumbnail: convert: no decode delegate for this image format `/tmp/magick-XXKuSImy' @ error/constitute.c/ReadImage/532.
convert: missing an image filename `/mnt/thumbs/wikipedia/commons/thumb/9/92/EAA2_Mississippi_River_Delta.pdf/page1-557px-EAA2_Mississippi_River_Delta.pdf.jpg' @ error/convert.c/ConvertImageCommand/2970."

py wrote:

this is referenced by rt 1175 which is now closed.

this can probably be closed, but needs verification.

Hmm, doesn't seem to be solved by the 8.71 upgrade (bug 26388), and this isn't fixed by 9.04 either ("Error: /syntaxerror in -file-GPL Ghostscript 9.04: Unrecoverable error, exit code 1"), so it doesn't seem likely that 9.05 is going to fix this (bug 36580). Someone should probably test this with the very latest version of Ghostscript, and if it's broken there, too, report a bug upstream (see http://www.ghostscript.com/ )

Still there. The PDF opens correctly on my machine and a user successfully converted it to https://commons.wikimedia.org/wiki/File:Finlands_Allm%C3%A4nna_Tidning_1820-01-03.djvu

(In reply to comment #7)

Still there. The PDF opens correctly on my machine and a user successfully
converted it to

Correctly on your machine with ghostscript or using some other program?

(In reply to comment #8)

(In reply to comment #7)

Still there. The PDF opens correctly on my machine and a user successfully
converted it to

Correctly on your machine with ghostscript or using some other program?

I had tried okular, but gs works too.

$ ghostscript Finlands_Allmänna_Tidning_1820-01-03.pdf
GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 4.
Page 1
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.

showpage, press <return> to continue<<

Page 2
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.

showpage, press <return> to continue<<

Page 3
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.

showpage, press <return> to continue<<

Page 4
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
Loading a TT font from /usr/share/ghostscript/9.05/Resource/CIDFSubst/DroidSansFallback.ttf to emulate a CID font Adobe-Identity ... Done.

showpage, press <return> to continue<<

I had tried okular, but gs works too.

Ok, that implies that the issue was fixed upstream and an upgrade to ghostscript would fix the issue.

Adding keyword ops.

(In reply to comment #10)

I had tried okular, but gs works too.

Ok, that implies that the issue was fixed upstream and an upgrade to
ghostscript would fix the issue.

I don't think so. We're already on 9.05...

Testcase in Comment 0

Trying https://upload.wikimedia.org/wikipedia/commons/archive/1/19/20121126125750%21Finlands_Allm%C3%A4nna_Tidning_1820-01-03.pdf in Ghostscript 9.06 from 2012-08-08 on a Fedora 18 machine I get:

  • Warning: File has unbalanced q/Q operators (too many q's)
  • This file had errors that were repaired or ignored.
  • Please notify the author of the software that produced this
  • file that it does not conform to Adobe's published PDF
  • specification.

Hence I don't see any valid bug report here and nothing that could fixed on Wikimedia's side. => Closing as INVALID.

A bug report should be filed against the tool the PDF was created with (unfortunately not exposed in its metadata).

If anybody thinks that GhostScript should be more forgiving, feel free to report a request at http://bugs.ghostscript.com/ .

Testcase in Comment 4

No problems reproducible, thumbnail shown, no issues in Ghostscript. Might have been a different issue that somehow disappeared.

lars wrote:

Andre, did you read the comments? In comment 12, Marco replaced
the original PDF with a modified PDF. That doesn't remove this bug,
which is that Mediawiki fails to generate thumbnails or a proper
error message for the original PDF. The original PDF still displays
properly in other software, so it is not broken.

(In reply to comment #14)

Andre, did you read the comments? In comment 12, Marco replaced
the original PDF with a modified PDF.

That's why I tested with the old PDF.

That doesn't remove this bug

This report covers a few things. One is the problem that sometimes thumbnails are not created for PDF files that Ghostscript considers to be invalid.

If this report is about the aspect "Provide some error message in the browser" then it is not fixed, indeed, but I consider this aspect ("Expose readable error messages in the browser" to be covered in bug 23831 already.

The original PDF still displays
properly in other software, so it is not broken.

So far the issue was the missing thumbnail, not how the PDF itself displays in other software. Ghostscript says the PDF file is broken and we use Ghostscript.
I can imagine that other software is more forgiving.
If you know that the PDF file is not broken and hence consider the error message in GhostScript wrong or misleading it would be best to discuss this with the GhostScript developers. See the link in comment 13.

lars wrote:

"Ghostscript says the PDF file is broken and we use Ghostscript."
With that logic, you can say "and we use Mediawiki 1.5", and stop
improving anything. Why should we report bugs anymore? Already in
comment 1, I suggested that perhaps we should use pdfimages
(which does work) instead of ghostscript (which is overly picky).

But if the file is indeed broken, then Ghostscript should be used
as a validator during upload and refuse to accept this broken file.

I think there is a misunderstanding here.

We are responsible for MediaWiki and this is the canonical, "upstream" bugtracker for MediaWiki, so we of course accept reports and fix bugs for it.

So far I have no reason to not believe the output of GhostScript that the specific PDF file is invalid. Again, if you think that GhostScript is wrong, the GhostScript developers need to be contacted "upstream", but I haven't seen any indication that it's a bug in GS so far. We use 3rd party software in many places (like PDF handling) to not reinvent the wheel (the related term is "downstream" - just mentioning the concept here, as I don't know how much open source background you have).

I suggested that perhaps we should use pdfimages (which does work)
instead of ghostscript (which is overly picky).

That's worth a separate enhancement request, please file it in this Bugzilla so it can be considered.

But if the file is indeed broken, then Ghostscript should be used
as a validator during upload and refuse to accept this broken file.

That's another pretty good idea, and worth another separate request. :)

In general only one issue per report should be handled, and this report is about a specific PDF file testcase that does not show a thumbnail, and from all I know so far the reason is that the PDF file is broken, so there's nothing to do server-/software-side (yet) for Wikimedia developers. Hence I closed this as INVALID. This does not mean that things could not be improved in several ways via several involved parties in the long run, but that's out of scope for this specific issue.

We seem to have 3200 affected files in [[Category:PDF files affected by MediaWiki restrictions]]

(In reply to comment #18)

[[Category:PDF files affected by MediaWiki restrictions]]

-> not en, Commons:
https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

lars wrote:

Long, long ago, I was a little interested in getting Wikisource
to work, and so, when I found something that didn't work, I used
to file bug reports like this one (in April 2010, mind you).

However, the tendency of every little problem to become huge and
impossible to solve has removed most of my previous interest.
Comment #17 above is one very typical example of how this happens.

Three years have passed. I leave it to others to try to get
Wikisource to work. I have another project to work on.
Have a good life.

(In reply to comment #20)

Long, long ago, I was a little interested in getting Wikisource
to work, and so, when I found something that didn't work, I used
to file bug reports like this one (in April 2010, mind you).

However, the tendency of every little problem to become huge and
impossible to solve has removed most of my previous interest.
Comment #17 above is one very typical example of how this happens.

Three years have passed. I leave it to others to try to get
Wikisource to work. I have another project to work on.
Have a good life.

Indeed this is very frustrating. As domas said on some other bug, it doesn't matter whose fault it is; what matters is that the site is broken for the users (and readers).

I think it's useful to discover that the problem lies in some PDF error, that might even be something users can "easily" solve themselves without waiting years for a bug fix; if we decide not to work around library restrictions, though, this doesn't make the problem disappear.
In other words, what are users supposed to do in order to fix those PDFs? Are there standard commands to do so? We could for instance run a bot on Commons (this bug would be moved to Wikimedia>General), or at least make the error more useful.

(In reply to comment #21)

I think it's useful to discover that the problem lies in some PDF error, that
might even be something users can "easily" solve themselves without waiting
years for a bug fix;

How?

if we decide not to work around library restrictions,
though, this doesn't make the problem disappear.

Which library restriction would you exactly like to work around here and how?

With which exact incentive was this bug report reopened? We cannot easily fix broken damaged PDF files that were uploaded, so what is the expectation?
(The feature requests in comment 16 should be separate bug reports as I write in comment 17). As I wrote before, a Ghostscript update very likely won't fix the issue in comment 0, and the different issue in comment 4 vanished.

If you are after better error message propogation etc, please make that a different enhancement request. For the testcases in comment 0 and comment 4 on this bug report, I still consider this bug report INVALID.

(In reply to comment #22)

(In reply to comment #21)

I think it's useful to discover that the problem lies in some PDF error, that
might even be something users can "easily" solve themselves without waiting
years for a bug fix;

How?

pdfimages works, apparently.

[editconflict]

I had a look at a non-representative amount of PDF files taken from https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

I encountered the following problems:

-> There is no "real" fix for those files. One could higher the limits but this would had an impact on the server performance.

-> Possible fix: Repair those files by bot or use another software which is less strict to process PDF files. Though changing the viewer could also introduce more problems or new bugs...

(In reply to comment #23)

pdfimages works, apparently.

$: man pdfimages
Pdfimages saves images from a Portable Document Format...

pdfimages does not save a PDF file as JPEG. It only extracts images from PDF files!?

(In reply to comment #18)

We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

I fixed 99% of all files.

(In reply to comment #26)

(In reply to comment #18)

We seem to have 3200 affected files in https://commons.wikimedia.org/wiki/Category:PDF_files_affected_by_MediaWiki_restrictions

I fixed 99% of all files.

Wonderful, let's consider this bug fixed (you deserve a medal!). There are two more bugs opened for some of the remaining files, which probably hit the resource limitations you mentioned.
Making MediaWiki work with such files by using lpr/CUPS or whatever would also be another request.