Page MenuHomePhabricator

Non-rendering of thumbnail of compressed pdf in Commons
Closed, ResolvedPublic

Description

The file https://commons.wikimedia.org/wiki/File:Montazem_Naseri.pdf was uploaded with high compression, size 25.5 MB. The file did not display (jpg preview, thumbnail, next page option -- nowhere). Then I decompressed it to 178.68 MB and overwrote the previous file. Now it displays. The thumbnail of the previous file is still not visible.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 12 2018, 1:20 AM

Cannot reproduce; the images are displayed for me. Which specific ones are missing? Links welcome. :)

The preview is visible now after my overwrite. But please see the upload log. The thumbnail of the original version (compressed pdf) is not visible.

Restricted Application added a project: Multimedia. · View Herald TranscriptJun 12 2018, 10:02 AM

The first version uploaded wasn't displaying thumbnail pages—choosing whichever you wanted, tried and confirmed at random pages—and this was either directly in the File: at Commons, or when viewed through some of the variations available at the Wikisources.

I have a similar problem with this file: https://commons.wikimedia.org/wiki/File:An_Anglo-Chinese_Vocabulary_of_the_Ningpo_Dialect.pdf
The thumbnail is not displayed even though the file is readable after being downloaded.

Vvjjkkii renamed this task from Non-rendering of thumbnail of compressed pdf in Commons to 67aaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 67aaaaaaaa to Non-rendering of thumbnail of compressed pdf in Commons.Jul 2 2018, 2:09 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Is it the same problem with https://commons.wikimedia.org/w/index.php?title=File%3AWilhelm_Gesenius_Hebr%C3%A4ische_Grammatik_(umgearbeitet_von_Emil_Kautzsch).pdf&page=328 ?

I'm pretty sure that the file is a valid PDF because it works on my laptop. It also previewed well at the source site: https://archive.org/details/wilhelmgesenius00gese .

Not entirely sure if this is Thumbor territory but as we get some tasks about this (e.g. T203402 might be a dup) could be good to get some attention here.

I have just experienced the same problem again, see https://commons.wikimedia.org/wiki/File:Horse-radish_culture_in_Bohemia.pdf

I really no know reason, but that file loads very slowly.

It seems this problem is connected with many (or all?) pdf files downloaded from archive.org, so it would really help if it were solved. It prevents such files to be proofread at Wikisource, as the proofread extension is not able to render the pdf pages from Commons as well.

That horse radish culture PDF is timing out when processed with ghostscript. Which means it's taking more than one minute on our production servers. It's an unreasonable amount of time for any thumbnail. The question is what's special about a 393KB PDF that it would take 1+ minute to extract a thumbnail from it.

On my own machine (MacOS, gs 9.23) it's fast. I've verified on a production Thumbor machine, thumbor1001, that it's excruciatingly slow there (Debian Jessie, gs 9.06). On a Debian Stretch WMCS machine (gs 9.20) it feels fast as well.

This is most likely a ghostscript bug making processing of that kind of file very slow on the version of ghostscript we're stuck with on Debian Jessie. This should be revisited once the Thumbor cluster has been updated to Debian Stretch and a much newer version of ghostscript.

I have just experienced the same problem also with a djvu file: https://commons.wikimedia.org/wiki/File:Modernczechpoetr00selvialab.djvu

I uploaded another version of the same djvu file, which is fine. The original problematic version (which worked well in my computer but not in Commons) can be found in history.

Until now I have experienced this problem only with files downloaded from archive.org, now for the first time I have the same problem with a file downloaded from Hathi Trust Digital Library, see https://commons.wikimedia.org/wiki/File:The_voice_of_an_oppressed_people.pdf . Meanwhile, more people were discussing the problem at en.wikisource, e.g. here: https://en.wikisource.org/w/index.php?title=Wikisource%3AScriptorium%2FHelp&type=revision&diff=8890604&oldid=8890506 or here: https://en.wikisource.org/w/index.php?title=Wikisource%3AScriptorium%2FHelp&type=revision&diff=8935250&oldid=8924491 . It would be nice if someone managed to find a solution. Could this task receive a higher priority?

I had a couple of attempts at re-processing the https://commons.wikimedia.org/wiki/File:The_voice_of_an_oppressed_people.pdf file.
(1) Export all pages as tiffs (66) then use Arcobat Pro XI to make a new file - slightly smaller file and works OK
(2) Take the original PDF and "Save as Optimized PDF" in Acrobat (it downsamples the images) - file size now one third, and still all OK and reads fine.
I assume it must be an issue with the PDF agent used to create the original file, and/or the images were scanned a too high a bit rate (each tiff was 4MB in size)

P.S. On the second trial Acrobat did throw a warning of "The PDF document contained image masks that were not downsampled." No idea what that means.

Meanwhile I tried to reprocess it as djvu, which sometimes (not always) helped in the past, but this time it did not: https://commons.wikimedia.org/wiki/File:The_voice_of_an_oppressed_people.djvu (I will nominate if for deletion after some while).

P. S. I am grateful that Ronhjones solved this particular file, but nobody knows how many other users, who are not experienced in both reprocessing the files and complaining at phabricator, are discouraged from downloading PDF books to Commons and Wikisource. Some real solution of the problem is really needed.

It has been found out that the complicated workaround consisting of exporting the files into TIFF and than back to PDF results in flattening the layers and considerable loss of quality of pictures. Not everybody also has software able to do such a workaround. May I ask if there is any progress ahead?

Gilles closed this task as Resolved.Feb 12 2019, 1:59 PM
Gilles claimed this task.

Fixed by the ghostscript update (at least for the image in the task description). Try purging affected files and clearing your browser cache.