Page MenuHomePhabricator

Specific PDF file only displays completely white page previews
Closed, ResolvedPublic

Description

I recently uploaded this file: https://commons.wikimedia.org/wiki/File:Report_of_the_Park_Board_1903.pdf

The pages do not appear as previews. I've tried two browsers, both logged in and logged out. The previews don't appear in any case. Another user has confirmed the problem: https://commons.wikimedia.org/wiki/Commons:Help_desk#Newly_uploaded_PDF_did_not_generate_page_previews

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 23 2018, 6:00 PM
Aklapper renamed this task from Newly uploaded PDF does not display page previews to Specific PDF file only displays completely white page previews.Feb 23 2018, 6:54 PM
Restricted Application added a project: Multimedia. · View Herald TranscriptFeb 23 2018, 6:55 PM

The image is not white when trying locally using ImageMagick via

$:acko\> convert Report_of_the_Park_Board_1903.pdf[1] example.png
$:acko\> rpm -q ImageMagick
ImageMagick-6.9.9.27-1.fc27.x86_64

However note that the Wikmedia server config uses $wgImageMagickConvertCommand = '/usr/local/bin/mediawiki-firejail-convert'; which makes T164145 come to my mind.

brion added a subscriber: brion.EditedFeb 23 2018, 7:31 PM

Works locally for me within PdfHandler as well (on Mac), though I think we now run the PDF thumbs through Thumbor rather than directly through MediaWiki+PdfHandler in production?

Manually running the command that thumbor would run with a local (not firejailed) ghostscript install gives me working output, but there is a warning form openjpeg:

$ gs -sDEVICE=jpeg -dJPEG=90 -sOutputFile='%stdout' -dFirstPage=1 -dLastPage=1 -r150 -dBATCH -dNOPAUSE -dSAFER -q -fReport_of_the_Park_Board_1903.pdf > test.jpg
openjpeg warning: Non conformant codestream TPsot==TNsot.
brion added a comment.EditedFeb 23 2018, 7:49 PM

One thing I notice is that the images in this file are all stored as JPEG 2000 (which is what openjpeg decompresses), while various other files I see working use various other PDF-specific compression formats.

You can use pdfimages -raw foo.pdf some-file-prefix to extract the images individually in their original format.

brion added a comment.Feb 23 2018, 8:08 PM

Here's what I get on a Debian jessie vm:

$ gs --version
9.06
$ gs -sDEVICE=jpeg -dJPEG=90 -sOutputFile='%stdout' -dFirstPage=1 -dLastPage=1 -r150 -dBATCH -dNOPAUSE -dSAFER -q -fReport_of_the_Park_Board_1903.pdf > test.jpg
error: cannot decode code stream
unable to decode JPX image data.

   **** Warning: File has insufficient data for an image.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Adobe Acrobat 8.13 Paper Capture Plug-in <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

And the resulting output file is blank/white.

Thanks for all the efforts. Until a full general fix is worked out, is there a way to get this file working? If the JPEG 2000 source is the issue, is there perhaps a way to convert the pages one-by-one, and then rebuild a PDF or DJVU? I'd be happy to try doing that, but if there's an imminent (possible) general solution, let me know and I'll hold off.

brion added a comment.Feb 28 2018, 7:47 PM

@Peteforsyth I tried converting the PDF to DJVU with the aptly-named pdf2djvu. Give this a whirl: https://brionv.com/misc/Report_of_the_Park_Board_1903.djvu

Thanks @brion -- I'll try it, I suppose by just uploading it at Commons. I take it what you used was a webapp? I should have thought of that. I believe Wikisource prefers DJVU anyway, so it's a helpful workaround in this instance.

Seems to work fine. I'm not marking this "resolved," in case there is still hope of solving the underlying problem...but this file is now good to go! Thanks again @brion.
-Pete

brion added a comment.Mar 5 2018, 9:36 PM

Once the Thumbor machines get updated from Debian Jessie to Stretch (T170817), it should resolve the underlying issue. Glad the workaround djvu file is working for now. :)

Ramsey-WMF moved this task from Untriaged to Tracking on the Multimedia board.Mar 9 2018, 6:45 PM

FYI; the Ghostscript version on our Thumbor servers got upgraded to 9.26, this might be worth re-testing.

Gilles closed this task as Resolved.Feb 12 2019, 1:58 PM
Gilles claimed this task.
Gilles added a subscriber: Gilles.

The JPX issue was definitely fixed, as seen in other tasks I just verified and closed.

JFTR, the recent update Ghostscript update to 9.26 also switched the JPEG2000 library from Jasper to OpenJPEG (current Ghostscript releases no longer support Jasper), so that might have also had an effect.