Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
Open, LowPublic
Actions

Assigned To

None

Authored By

	ShakespeareFan00
	Jul 1 2020, 9:44 AM

Description

https://en.wikisource.org/wiki/File:Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_(IA_catalogoftitleen11118libr).pdf

Pages from this file display with reduced quality in the Commons/Wikisource interface compared to viewing them directly from the PDF in the internal Mozilla viewer.

I am suspecting this may be due to differences in support for the JPEG variants assumed to be used internally by the PDF, and it may be worth checking what level of support the underlying tools used actually have.

Steps to Reproduce respective images for quality comparison.

(For the Commons/Wikisource side)

https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf/page26-1016px-Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf.jpg

Direct view:-

https://upload.wikimedia.org/wikipedia/commons/6/68/Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1%2C_Nos._1-26_%28IA_catalogoftitleen11118libr%29.pdf (Using Mozilla Firefox Nightly 79.0.1a)

2.(Options) No spreads and scroll down to page 26 of the PDF.

Related Objects

Mentioned In: T224355: Worsening of PDF book scan quality in the Wikisource Page namespace
T278623: Create a Section for Numerically Sequencing Images on Index ns
T257025: Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source)
T256959: Allow PDF's to be rendered at higher (or user specified DPI)
T254459: Large PDF upload issue
Mentioned Here: T224355: Worsening of PDF book scan quality in the Wikisource Page namespace

Event Timeline

ShakespeareFan00 created this task.Jul 1 2020, 9:44 AM

Restricted Application added a project: Commons. · View Herald TranscriptJul 1 2020, 9:44 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ShakespeareFan00 mentioned this in T254459: Large PDF upload issue.Jul 1 2020, 9:45 AM

Fae awarded a token.Jul 1 2020, 10:09 AM

Fae subscribed.

Removing PDF-Rendering as this is not related to PDF styles and functionality, but instead is about thumbnails.
Removing MediaWiki-File-management (see its description) as thumbnails are not in scope.
Adding Thumbor as this is about thumbnails of some file.

@Aklapper : Thanks for the update.

: I am very strongly thinking this is in part due to PDF internals, as I had issues when trying to copy an image directly from Acrobat Reader into IfranView under Windows, it only copied a single layer, not the whole image.

I think what may need to happen is some kind of 'flattening' of images, given that I think may be happening is that Thumbor (or other parts of the back-end) are only picking up one basic layer, instead of the whole "flattened" page.

[strike]The viewer code Mozilla Firefox appears to be using is - https://github.com/mozilla/pdf.js (Apache license) . [/strike]

Not as relevant, see subsequent comments.

@ShakespeareFan00: Extracting pages and creating thumbnail files from PDF files (Thumbor) is unrelated to software displaying complete PDF files themselves (Mozilla's pdf.js).

Aklapper: Internally PDf.js decodes JPEG images so it can render them in the viewer.

Comparison of the approach taken there, against the approach taken by Thumbor might prove insightful (Although it seems the code for Thumbor hands PDF over to ghostscript rather than attempting to handle it directly.)

(I'm also checking the output in gsview , which is taking for... absolutely... e..v..e..r to render even single pages.)

For comparison:-

- Output on Mediawiki.

- Output from GSview 6.0

Even though it's not an exact match, the GSview version is higher quality... So I am wondering where the loss in quality happens, given that gsview should be showing the Ghostscript output.

I see two major quality issues with the rendered JPGs: compression artifacts around the text and missing pixels around the text.

The PDF renders the same on Ghostscript 9.52.

150dpi.jpg168 KBDownload

Page 26 has a nominal size of 474 x 646 pts, or 987.5 x 1345.8 px at 150 dpi. MediaWiki rounds that down to 987x1345 px.

According to pdfimages, there are two images and a mask on page 26 of the PDF:

$ pdfimages Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_\(IA_catalogoftitleen11118libr\).pdf -f 26 -l 26 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
  26     0 image    1098  1497  rgb     3   8  jpx    no       123  0   167   167 7337B 0.1%
  26     1 image    3292  4490  rgb     3   8  jpx    no       124  0   501   501 49.4K 0.1%
  26     2 mask     3292  4490  -       1   1  jpx    no       124  0   501   501 49.4K 2.7%

Image 0 represents the page itself, while Image 1 combined with the mask contains the text. Both are larger than the page size at 150 dpi.

The same artifacts do not appear in PNG output at 150dpi, but the masked text is not exactly perfect either.

150dpi.png488 KBDownload

Looking closer, the artifacts appear in 8x8 px blocks containing any pixels that came from Image 1 and the mask. Details represented in Image 0 and not Image 1 do not appear to have the same artifacts.

Doubling the output DPI reduces the artifacts around the text.

300dpi.jpg456 KBDownload

This issue appears to be related to how GhostScript deals with raster masks at lower resolutions. Each point in the PDF corresponds to 7 pixels from the text image, which is a fairly large scaling factor. It appears that GS is scaling the mask layer down first and losing some of the detail in the process. Other PDF viewers are likely operating at a higher DPI (and thus not scaling as much) or scaling after the mask is applied (and retaining more detail).

There are four possible solutions here:

Upload the DJVU from the Internet Archive and use that
"flatten" the PDF into one image layer per page
Change the page size to be closer to the image size. For page 26 at least, that would be 1580 x 2155 pts. This would cause GS to upscale the background image instead of downscaling the text image.
Report the problem upstream, hope that it gets fixed, then wait for a new version of GS to be deployed to Wikimedia. This may be intended behavior or unfixable at the GS level, I don't know. Even if one of the other solutions is chosen, it wouldn't hurt to report this upstream.

AntiCompositeNumber triaged this task as Low priority.Jul 1 2020, 6:14 PM

AntiCompositeNumber moved this task from Incoming to Thumbnail and file renderings on the Commons board.

AntiCompositeNumber moved this task from Backlog to Thumbnail quality on the Thumbor board.

AntiCompositeNumber moved this task from Backlog to To upstream/missing upstream link on the Upstream board.

Batch uploading the Djvu equivalents for PDF files is feasible, If automated ( Anyone want to write a script?). However in some instances I am wondering if the DJVU's at IA are generated from the PDF.. and thus might inherit related issues.)

Can the PDF be 'flattened' using tools available on the WMF servers? Doing it manually for every single page of every PDF uploaded in a recent batch would be very time consuming for volunteers.

Can this be done with an automated process?
The next question is which repository upstream? (And I will note that gsview didn't display this incorrectly.)

(It reports using Ghostscript 9.19, although the version I have is actually 9.22).. Hmm... Not sure if there is any easy way to test comprative versions of Gsview , Ghostscript. quickly..

The next question is which repository upstream?

https://bugs.ghostscript.com/

Logged upstream as https://bugs.ghostscript.com/show_bug.cgi?id=702531

If you have a login there, feel free to expand.

Option 5. Provide an option in the Image handling at mediawiki to render at a higher DPI. ( I have a strong hunch these are likely to have been scanned at least 300/600 dpi if not higher. I wonder what is typical in an archival situation?)

@AntiCompositeNumber : Can you consider some test renderings at 150, 300, 600 and 1200 dpi respectively?

(I doubt 1200 would be needed for general use, but might be needed for some really small print directories or diagram images.)

ShakespeareFan00 renamed this task from Images from PDF displayed with degraded quality (or progressive layers not taken into consideration when rendering from PDF. to Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?)..Jul 2 2020, 8:56 AM

Is this also related to T224355 ?

From some investigations of DPI levels using IfranView, I think this 'bug' can be solved by upping the DPI for generated images on PDF.

Rendering at 300/600 vs 150 seemingly generated the high-quality scans desired :) The next question is how to flag a File: so that the Backend knows to use a different DPI vaue when generating output.

ShakespeareFan00 mentioned this in T256959: Allow PDF's to be rendered at higher (or user specified DPI).Jul 2 2020, 11:38 AM

Question: When it resizes, is Thumbor using a rescale or a Resample?

Ifran view is using something it calls Lanczos (I think this is https://en.wikipedia.org/wiki/Lanczos_resampling), but as IfranView isn't 'free' software I can't link to a repository to check.

The ticket at Ghostscript suggested this : https://bugs.ghostscript.com/show_bug.cgi?id=702531#c1, which is a tweak to the invocation used to render the PDF. It also suggested using something called convert. (On Wikimedia this would be the Thumbor library ?)

ShakespeareFan00 mentioned this in T257025: Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source).Jul 3 2020, 9:04 AM

Languageseeker mentioned this in T278623: Create a Section for Numerically Sequencing Images on Index ns.Mar 31 2021, 12:33 PM

Meow moved this task from Thumbnail and file renderings to Incoming on the Commons board.Jul 28 2021, 9:04 AM

Meow moved this task from Incoming to Thumbnail and file renderings on the Commons board.Jul 28 2021, 9:26 AM

ShakespeareFan00 mentioned this in T224355: Worsening of PDF book scan quality in the Wikisource Page namespace.Nov 4 2022, 7:51 PM

	F31912999: comparison.png
	Jul 1 2020, 6:13 PM

	F31912940: 150dpi.png
	Jul 1 2020, 6:13 PM

	F31912910: 300dpi.jpg
	Jul 1 2020, 6:13 PM

	F31912908: test.jpg
	Jul 1 2020, 6:13 PM

Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).Open, LowPublicActions

Description

Related Objects

Event Timeline

Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
Open, LowPublic
Actions