Page MenuHomePhabricator

Thumbnailing fails with HTTP 429 for PDF 1.5 book which renders properly with up-to-date Ghostscript
Closed, InvalidPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

  • In the first case, the thumbnails fail to load properly
  • The second case shows that it fails to load with a 429 error specifically

What should have happened instead?:

You should be able to see the actual thumbnails, given that opening the PDF in Firefox or with Evince works fine.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

There are a bunch of other potentially relevant bugs that share the same theme of "PDFs don't render properly and spit out 429s"; T188885 is the only one I can find that's still open; I'm pretty sure this isn't a PDF/A file, however.

You'll see two revisions of the file, both with the same issue. The first version of the file, when I ran qpdf --check, came up with a warning that the "reported number of objects (11411) is not one plus the highest object number (1160137)". However, after I ran qpdf --linearize in an attempt to fix that issue, the warning went away; the output of that process is the current file revision.

Running gs -sDEVICE=jpeg -dJPEG=90 -sstdout=%stderr -sOutputFile=%stdout -dFirstPage=18 -dLastPage=18 -r150 -dBATCH -dNOPAUSE -dSAFER -q -f"Noted_Negro_Women_(1893)_HathiTrust_scan.pdf" > out.jpg (which as far as I can tell is the command Thumbor runs) on my local machine, with Ghostscript version 9.53.3, successfully produces a JPG with no visible errors. This is taken from the current code, after the fix for T236240.

Event Timeline

Cannot reproduce any problem with the two URL given above.

The thumbnail of the previous revision on https://commons.wikimedia.org/wiki/File:Noted_Negro_Women_(1893)_HathiTrust_scan.pdf is a 503 though.

Vahurzpu closed this task as Resolved.EditedAug 14 2021, 6:46 PM
Vahurzpu claimed this task.

Huh, that's weird; it's started working for me as well. Maybe I just forgot to clear my cache after uploading a new version?

AntiCompositeNumber changed the task status from Resolved to Invalid.Aug 14 2021, 7:13 PM
AntiCompositeNumber removed Vahurzpu as the assignee of this task.

The new version would have continued to 429 even after you uploaded the new version, as the 1-hour failure throttle isn't reset (I've created T288895 for that, though it isn't exactly easy to fix).

Testing the old version locally:

2021-08-14 19:01:43 thumbor:DEBUG [HTTPS] return_contents: /tmp/tmpSDIKsc
2021-08-14 19:01:43 thumbor:DEBUG METRICS: inc: original_image.status.200:1
2021-08-14 19:01:43 thumbor:DEBUG METRICS: inc: original_image.response_bytes:4096
2021-08-14 19:01:43 thumbor:DEBUG [Proxy] Looking for a pdf engine
2021-08-14 19:01:43 thumbor:DEBUG [BWE] Found source file in context
2021-08-14 19:01:43 thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/gs', '-sDEVICE=jpeg', '-dJPEG=90', '-sstdout=%stderr', '-sOutputFile=%stdout', '-dFirstPage=1', '-dLastPage=1', '-r150', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-q', '-f/tmp/tmpSDIKsc']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stdout: <too long to display (53292 bytes)>
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stderr:    **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.

2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Return code: 0
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Duration: 49916.757
2021-08-14 19:02:33 thumbor:DEBUG [IM] Dumping buffer into temp file
2021-08-14 19:02:33 thumbor:DEBUG [ExiftoolRunner] command: ['/usr/bin/exiftool', '-s', '-s', '-ImageSize', '-ProfileDescription', '-ColorType', '-FileType', '-Transparency', '-Artist', '-Copyright', '-ImageDescription', '/tmp/tmpbaWNAs']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/exiftool', '-s', '-s', '-ImageSize', '-ProfileDescription', '-ColorType', '-FileType', '-Transparency', '-Artist', '-Copyright', '-ImageDescription', '/tmp/tmpbaWNAs']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stdout: ImageSize: 1275x1650
ProfileDescription: Artifex Software sRGB ICC Profile
FileType: JPEG

2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stderr: 
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Return code: 0
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Duration: 91.679
2021-08-14 19:02:33 thumbor:DEBUG [IM] EXIF: {'ProfileDescription': 'Artifex Software sRGB ICC Profile', 'FileType': 'JPEG', 'ImageSize': '1275x1650'}
2021-08-14 19:02:33 thumbor:DEBUG [IM] File has non-sRGB profile
2021-08-14 19:02:33 thumbor:DEBUG [ExiftoolRunner] command: ['/usr/bin/exiftool', '-icc_profile', '-b', '-m', '/tmp/tmpbaWNAs']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/exiftool', '-icc_profile', '-b', '-m', '/tmp/tmpbaWNAs']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stdout: <too long to display (2576 bytes)>
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Stderr: 
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Return code: 0
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Duration: 92.39500000000001
2021-08-14 19:02:33 thumbor:DEBUG [IM] reorientate
2021-08-14 19:02:33 thumbor:DEBUG [IM] resize: 463.0 599.0
2021-08-14 19:02:33 thumbor:DEBUG [IM] jpeg:size hint: '926x1198'
2021-08-14 19:02:33 thumbor:DEBUG [IM] Queued operators: ['-define', 'jpeg:size=926x1198', '-resize', '463x599^', '-gravity', 'center', '-extent', '463x599']
2021-08-14 19:02:33 thumbor:DEBUG No image format specified. Retrieving from the image extension: .jpg.
2021-08-14 19:02:33 thumbor:DEBUG Content Type of image/jpeg detected.
2021-08-14 19:02:33 thumbor:DEBUG [BWE] Defaulting to .jpg
2021-08-14 19:02:33 thumbor:DEBUG [IM] read: .jpg 87
2021-08-14 19:02:33 thumbor:DEBUG [IM] Chroma subsampling: '4:2:0'
2021-08-14 19:02:33 thumbor:DEBUG [IM] Generating image with quality 87
2021-08-14 19:02:33 thumbor:DEBUG [IM] Queued operators: ['-define', 'jpeg:size=926x1198', '-resize', '463x599^', '-gravity', 'center', '-extent', '463x599', '-quality', '87', '-sampling-factor', '4:2:0', '-interlace', 'Plane']
2021-08-14 19:02:33 thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/convert', '-define', 'tiff:exif-properties=no', '-define', 'jpeg:size=926x1198', '-resize', '463x599^', '-gravity', 'center', '-extent', '463x599', '-quality', '87', '-sampling-factor', '4:2:0', '-interlace', 'Plane', '/tmp/tmpbaWNAs[0]', 'jpg:-']
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Stdout: <too long to display (8750 bytes)>
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Stderr: 
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Return code: 0
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Duration: 61.229
2021-08-14 19:02:34 thumbor:DEBUG [IM] Processing EXIF
2021-08-14 19:02:34 thumbor:DEBUG [IM] Putting saved ICC profile into temp file
2021-08-14 19:02:34 thumbor:DEBUG [ExiftoolRunner] command: ['/usr/bin/exiftool', '-m', '-all=', '-icc_profile<=/tmp/tmplpkjKk', '/tmp/tmpQ5Ruuv', '-o', '-']
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Command: ['/usr/bin/timeout', '--foreground', '59', '/usr/bin/exiftool', '-m', '-all=', '-icc_profile<=/tmp/tmplpkjKk', '/tmp/tmpQ5Ruuv', '-o', '-']
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Stdout: <too long to display (8732 bytes)>
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Stderr: 
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Return code: 0
2021-08-14 19:02:34 thumbor:DEBUG [ShellRunner] Duration: 156.601
2021-08-14 19:02:34 thumbor:DEBUG METRICS: timing: engine.processing_time.wikimedia_thumbor.engine.ghostscript:50368
2021-08-14 19:02:34 thumbor:DEBUG METRICS: timing: engine.processing_utime.wikimedia_thumbor.engine.ghostscript:42755
2021-08-14 19:02:34 tornado.access:INFO 200 GET /thumbor/unsafe/463x/filters:page(1)/https://upload.wikimedia.org/wikipedia/commons/archive/c/c6/20210814173111!Noted_Negro_Women_(1893)_HathiTrust_scan.pdf (172.17.0.1) 54302.04ms
2021-08-14 19:02:34 thumbor:DEBUG METRICS: timing: response.time:54301
2021-08-14 19:02:34 thumbor:DEBUG METRICS: timing: response.time.200:54301
2021-08-14 19:02:34 thumbor:DEBUG METRICS: inc: response.status.200:1
2021-08-14 19:02:34 thumbor:DEBUG METRICS: inc: response.format.jpg:1
2021-08-14 19:02:34 thumbor:DEBUG METRICS: timing: response.time.jpg:54301
2021-08-14 19:02:34 thumbor:DEBUG METRICS: inc: response.bytes.jpg:8732
2021-08-14 19:02:34 thumbor:DEBUG [HTTPS] cleanup_temp_file: /tmp/tmpSDIKsc

Ghostscript does recover a workable image, but it takes it quite a while to do so. Wall-clock times on my laptop tend to be slightly faster than in production, so I would not be surprised if the 50 ghostscript execution in this case extended past the 59 second limit. That would cause thumbnail generation to fail (503), do that three times in an hour and you get 429s.

Making the file not invalid/corrupt is the best fix here; that's been done so closing as invalid (not a Thumbor bug).