Page MenuHomePhabricator

Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
Open, LowPublic

Assigned To
None
Authored By
ShakespeareFan00
Jul 1 2020, 9:44 AM
Referenced Files
F31912999: comparison.png
Jul 1 2020, 6:13 PM
F31912940: 150dpi.png
Jul 1 2020, 6:13 PM
F31912910: 300dpi.jpg
Jul 1 2020, 6:13 PM
F31912908: test.jpg
Jul 1 2020, 6:13 PM
F31912980: Raw01.jpg
Jul 1 2020, 5:54 PM
F31912982: Raw02.jpg
Jul 1 2020, 5:54 PM
Tokens
"Doubloon" token, awarded by Fae.

Description

https://en.wikisource.org/wiki/File:Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_(IA_catalogoftitleen11118libr).pdf

Pages from this file display with reduced quality in the Commons/Wikisource interface compared to viewing them directly from the PDF in the internal Mozilla viewer.

I am suspecting this may be due to differences in support for the JPEG variants assumed to be used internally by the PDF, and it may be worth checking what level of support the underlying tools used actually have.

Steps to Reproduce respective images for quality comparison.

(For the Commons/Wikisource side)

  1. https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf/page26-1016px-Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf.jpg

Direct view:-

  1. https://upload.wikimedia.org/wikipedia/commons/6/68/Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf (Using Mozilla Firefox Nightly 79.0.1a)

2.(Options) No spreads and scroll down to page 26 of the PDF.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Removing PDF-Rendering as this is not related to PDF styles and functionality, but instead is about thumbnails.
Removing MediaWiki-File-management (see its description) as thumbnails are not in scope.
Adding Thumbor as this is about thumbnails of some file.

@Aklapper : Thanks for the update.

: I am very strongly thinking this is in part due to PDF internals, as I had issues when trying to copy an image directly from Acrobat Reader into IfranView under Windows, it only copied a single layer, not the whole image.

I think what may need to happen is some kind of 'flattening' of images, given that I think may be happening is that Thumbor (or other parts of the back-end) are only picking up one basic layer, instead of the whole "flattened" page.

[strike]The viewer code Mozilla Firefox appears to be using is - https://github.com/mozilla/pdf.js (Apache license) . [/strike]

Not as relevant, see subsequent comments.

@ShakespeareFan00: Extracting pages and creating thumbnail files from PDF files (Thumbor) is unrelated to software displaying complete PDF files themselves (Mozilla's pdf.js).

Aklapper: Internally PDf.js decodes JPEG images so it can render them in the viewer.

Comparison of the approach taken there, against the approach taken by Thumbor might prove insightful (Although it seems the code for Thumbor hands PDF over to ghostscript rather than attempting to handle it directly.)

(I'm also checking the output in gsview , which is taking for... absolutely... e..v..e..r to render even single pages.)

For comparison:-

Raw01.jpg (1×1 px, 214 KB)
- Output on Mediawiki.

Raw02.jpg (885×649 px, 249 KB)
- Output from GSview 6.0

Even though it's not an exact match, the GSview version is higher quality... So I am wondering where the loss in quality happens, given that gsview should be showing the Ghostscript output.

I see two major quality issues with the rendered JPGs: compression artifacts around the text and missing pixels around the text.

The PDF renders the same on Ghostscript 9.52.

Page 26 has a nominal size of 474 x 646 pts, or 987.5 x 1345.8 px at 150 dpi. MediaWiki rounds that down to 987x1345 px.

According to pdfimages, there are two images and a mask on page 26 of the PDF:

$ pdfimages Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_\(IA_catalogoftitleen11118libr\).pdf -f 26 -l 26 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  26     0 image    1098  1497  rgb     3   8  jpx    no       123  0   167   167 7337B 0.1%
  26     1 image    3292  4490  rgb     3   8  jpx    no       124  0   501   501 49.4K 0.1%
  26     2 mask     3292  4490  -       1   1  jpx    no       124  0   501   501 49.4K 2.7%

Image 0 represents the page itself, while Image 1 combined with the mask contains the text. Both are larger than the page size at 150 dpi.

The same artifacts do not appear in PNG output at 150dpi, but the masked text is not exactly perfect either.

Looking closer, the artifacts appear in 8x8 px blocks containing any pixels that came from Image 1 and the mask. Details represented in Image 0 and not Image 1 do not appear to have the same artifacts.

Doubling the output DPI reduces the artifacts around the text.

comparison.png (165×384 px, 40 KB)

This issue appears to be related to how GhostScript deals with raster masks at lower resolutions. Each point in the PDF corresponds to 7 pixels from the text image, which is a fairly large scaling factor. It appears that GS is scaling the mask layer down first and losing some of the detail in the process. Other PDF viewers are likely operating at a higher DPI (and thus not scaling as much) or scaling after the mask is applied (and retaining more detail).

There are four possible solutions here:

  1. Upload the DJVU from the Internet Archive and use that
  2. "flatten" the PDF into one image layer per page
  3. Change the page size to be closer to the image size. For page 26 at least, that would be 1580 x 2155 pts. This would cause GS to upscale the background image instead of downscaling the text image.
  4. Report the problem upstream, hope that it gets fixed, then wait for a new version of GS to be deployed to Wikimedia. This may be intended behavior or unfixable at the GS level, I don't know. Even if one of the other solutions is chosen, it wouldn't hurt to report this upstream.
  1. Batch uploading the Djvu equivalents for PDF files is feasible, If automated ( Anyone want to write a script?). However in some instances I am wondering if the DJVU's at IA are generated from the PDF.. and thus might inherit related issues.)
  1. Can the PDF be 'flattened' using tools available on the WMF servers? Doing it manually for every single page of every PDF uploaded in a recent batch would be very time consuming for volunteers.
  1. Can this be done with an automated process?
  2. The next question is which repository upstream? (And I will note that gsview didn't display this incorrectly.)

(It reports using Ghostscript 9.19, although the version I have is actually 9.22).. Hmm... Not sure if there is any easy way to test comprative versions of Gsview , Ghostscript. quickly..

Option 5. Provide an option in the Image handling at mediawiki to render at a higher DPI. ( I have a strong hunch these are likely to have been scanned at least 300/600 dpi if not higher. I wonder what is typical in an archival situation?)

@AntiCompositeNumber : Can you consider some test renderings at 150, 300, 600 and 1200 dpi respectively?

(I doubt 1200 would be needed for general use, but might be needed for some really small print directories or diagram images.)

ShakespeareFan00 renamed this task from Images from PDF displayed with degraded quality (or progressive layers not taken into consideration when rendering from PDF. to Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?)..Jul 2 2020, 8:56 AM

From some investigations of DPI levels using IfranView, I think this 'bug' can be solved by upping the DPI for generated images on PDF.

Rendering at 300/600 vs 150 seemingly generated the high-quality scans desired :) The next question is how to flag a File: so that the Backend knows to use a different DPI vaue when generating output.

Question: When it resizes, is Thumbor using a rescale or a Resample?

Ifran view is using something it calls Lanczos (I think this is https://en.wikipedia.org/wiki/Lanczos_resampling), but as IfranView isn't 'free' software I can't link to a repository to check.

The ticket at Ghostscript suggested this : https://bugs.ghostscript.com/show_bug.cgi?id=702531#c1, which is a tweak to the invocation used to render the PDF. It also suggested using something called convert. (On Wikimedia this would be the Thumbor library ?)