Page MenuHomePhabricator

Thumbnails of many scanned PDF books show OCR text instead of scanned pages
Closed, ResolvedPublic

Event Timeline

Ankry created this task.Jan 20 2019, 7:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 20 2019, 7:58 PM
Ankry added a comment.Jan 20 2019, 8:02 PM

Please note: PDF rendering is critical for Wikisources where users dgitize books. Providing an OCR layer instead of real page images is misleading contributors who may interpret OCR errors as print errors. Please advice about urgent fix.

Even 404 errors might be better here than "fake" scans.

Matlin added a subscriber: Matlin.Jan 20 2019, 8:17 PM
Aklapper renamed this task from Broken rendering of PDF based books to Thumbnails of many scanned PDF books show OCR text instead of scanned pages.Jan 20 2019, 9:13 PM
Aklapper added a project: Thumbor.
Gilles added a subscriber: Gilles.Jan 21 2019, 11:04 AM

This is another occurence of the ancient ghostscript we use in production having issues with specific files. It outputs the following while falling back to the OCR layer:

gilles@thumbor2001:~$ gs -dFirstPage=1 -dLastPage=1 -dNOPAUSE -sDEVICE=jpeg -dJPEG=90 -sOutputFile=foo.jpg -dSAFER -dBATCH -r150 -q Z_pogrzebu_Mickiewicza_na_Wawelu_4go_Lipca_1890_roku.pdf 
unable to decode JPX image data.

   **** Warning: File has insufficient data for an image.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> ABBYY FineReader 11 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

gilles@thumbor2001:~$ gs -version
GPL Ghostscript 9.06 (2012-08-08)
Copyright (C) 2012 Artifex Software, Inc.  All rights reserved.

Checking on the Beta Stretch Thumbor host, it converts fine and gives the expected output. This will be solved by the upgrade to Stretch.

Gilles closed this task as Resolved.Feb 12 2019, 1:56 PM
Gilles claimed this task.

Fixed by the ghostscript update. Trying purging affected files (I purged the ones in the task description).