Page MenuHomePhabricator

Thumbnails of many scanned PDF books show OCR text instead of scanned pages
Closed, ResolvedPublic

Event Timeline

Please note: PDF rendering is critical for Wikisources where users dgitize books. Providing an OCR layer instead of real page images is misleading contributors who may interpret OCR errors as print errors. Please advice about urgent fix.

Even 404 errors might be better here than "fake" scans.

Aklapper renamed this task from Broken rendering of PDF based books to Thumbnails of many scanned PDF books show OCR text instead of scanned pages.Jan 20 2019, 9:13 PM
Aklapper added a project: Thumbor.

This is another occurence of the ancient ghostscript we use in production having issues with specific files. It outputs the following while falling back to the OCR layer:

gilles@thumbor2001:~$ gs -dFirstPage=1 -dLastPage=1 -dNOPAUSE -sDEVICE=jpeg -dJPEG=90 -sOutputFile=foo.jpg -dSAFER -dBATCH -r150 -q Z_pogrzebu_Mickiewicza_na_Wawelu_4go_Lipca_1890_roku.pdf 
unable to decode JPX image data.

   **** Warning: File has insufficient data for an image.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> ABBYY FineReader 11 <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

gilles@thumbor2001:~$ gs -version
GPL Ghostscript 9.06 (2012-08-08)
Copyright (C) 2012 Artifex Software, Inc.  All rights reserved.

Checking on the Beta Stretch Thumbor host, it converts fine and gives the expected output. This will be solved by the upgrade to Stretch.

Gilles claimed this task.

Fixed by the ghostscript update. Trying purging affected files (I purged the ones in the task description).