Worsening of PDF book scan quality in the Wikisource Page namespace
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Ratte
	May 25 2019, 9:58 PM

Description

In the Wikisource Page namespace, the quality of a PDF book scan uploaded on Wikimedia Commons is artificially worsening.
Examples:

Related Objects

Mentioned In: T135313: PDF file lost its resolution on proofreading edit mode
T43614: ProofreadPage does not use image's full resolution when zooming in
T287653: Add button to Index edit form to prefetch all image thumbs
T278623: Create a Section for Numerically Sequencing Images on Index ns
T257025: Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source)
T256848: Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
Mentioned Here: T256848: Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
T256959: Allow PDF's to be rendered at higher (or user specified DPI)
T184867: Unexpectedly low scan resolution in Page namespace for some DjVu books

Event Timeline

Ratte created this task.May 25 2019, 9:58 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 25 2019, 9:58 PM

Ratte updated the task description. (Show Details)May 25 2019, 10:08 PM

Vladis13 subscribed.May 25 2019, 10:25 PM

Vladis13 added projects: MediaWiki-extensions-PdfHandler, MediaWiki-File-management, Commons.May 25 2019, 10:28 PM

Restricted Application added a project: Multimedia. · View Herald TranscriptMay 25 2019, 10:28 PM

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

In T224355#5212731, @Reedy wrote:

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

They are worse in the thumbnails.

Reedy edited projects, added Thumbor; removed MediaWiki-File-management.May 26 2019, 12:25 AM

Ratte added a project: ProofreadPage.May 26 2019, 12:43 AM

It is maybe related to T184867. Has the first page of the PDF a smaller size orresolution than the other ones?

In T224355#5212764, @Tpt wrote:

Has the first page of the PDF a smaller size orresolution than the other ones?

Yes.
This PDF: the first page 347x524 (22,4 kB), others about 377x529 (42,3 kB)
This PDF: the first page 163x253 (2,49 kB), others about 166x255 (3,90 kB)
Title page usually has smaller size, because there's not so many text.

PS. It's not only lower resolution in ProofreadPage, text itself becomes more fuzzy than on commons.

High resolution thumbnails from the file, like:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf/page7-1834px-%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf.jpg
look poor and exhibit artifacts likely resulting from scalling-up jpg image with lossy compression

While PBM images extracted from this PDF file using pdfimages program from the xpdf package being the same resolution:

$ pdfimages -f 7 -l 7 Пушкин._Евгений_Онегин_\(1837\).pdf x
$ identify x-000.pbm
x-000.pbm PBM 1834x2829 1834x2829+0+0 1-bit Bilevel Gray 651KB 0.010u 0:00.010

are much higher quality.
It seems that for some reason non-maximum quality images are extracted from the PDF by the software used here. It is unrelated to the first page quality.

Xover subscribed.May 26 2019, 12:59 PM

Suzukaze-c subscribed.Jul 23 2019, 2:01 AM

Masumrezarock100 subscribed.Oct 8 2019, 7:50 PM

The PDF thumbnailing works in two steps: First, Ghostscript extracts a page to JPG at the original resolution of the page. Then ImageMagick scales it to the requested size. The pages in this document vary in size around about 340x530 +/- 5 px, which causes some of the quality loss as images are upscaled. (ProofreadPage thinks every page in the file is 339 × 527).

The PDF says the page's 338x527px, so Ghostscript writes a JPEG at 338x527. That doesn't sound like a bug in the thumbnailer to me, it sounds like a problem with the PDF.

AntiCompositeNumber moved this task from Backlog to Thumbnail quality on the Thumbor board.May 22 2020, 4:08 AM

ShakespeareFan00 mentioned this in T256848: Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?)..Jul 2 2020, 8:58 AM

ShakespeareFan00 mentioned this in T257025: Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source).Jul 3 2020, 9:04 AM

Languageseeker mentioned this in T278623: Create a Section for Numerically Sequencing Images on Index ns.Mar 31 2021, 12:33 PM

The problem here is a mismatch between the PDF's page size (in pts) and the image stream size.

Taking the file above (as pushkin.pdf) and only regarding the x-axis (the same logic holds for the y-axis):

$ pdfimages push.pdf -f 1 -l 1 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1813  2815  gray    1   1  ccitt  no      1899  0   800   800 2140B 0.3%

So we see we have an image of width 1813 (units: [px]) with a PPI of 800 (units: [px / in]).

However, the PDF page size is specified in pt, which is defined as 1/72 inch. This leads to a page size of (1813 [px] / 800 [px/in]) * 72 [pt/in] = 163.17 [pt]. Which is indeed exactly what we see:

$ pdfinfo push.pdf
...
Page size:      163.17 x 253.35 pts
...

This then appears to be rendered by the thumbnail at 150 ppi, resulting in 163.17 [pt] / 72 [pt/in] * 150 [px/in] = 339.93 [px], which is truncated to 339px.

Because the PDF images are specified as being small but "dense" (imagine a physically small book printed on a very high-quality press), rendering this PDF at a DPI like 150 is going to produce very small downscaled images (like taking a photo of said book with a rubbish camera). To make this image better, it should be rendered at a higher DPI, where 800 would produce the original images.

This calculation can be seen in PdfImage:php, line 99

$width  = intval( trim( $size[0] ) / 72 * $wgPdfHandlerDpi );

150 dpi is the default for $wgPdfHandlerDpi per https://www.mediawiki.org/wiki/Extension:PdfHandler

Normally, PDFs that come from the IA have images like this:

$  pdfimages conspiracietrago00chap_0.pdf -f 1 -l 1 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     808  1198  rgb     3   8  jpx    no       756  0   167   167 8234B 0.3%
   1     1 image    2422  3593  rgb     3   8  jpx    no       757  0   501   500 15.0K 0.1%
   1     2 mask     2422  3593  -       1   1  jpx    no       757  0   501   500 15.0K 1.4%

Where the "main content" of the image is 500ppi (sometimes a tiny bit off due to rounding of some sort), and the 167 DPI image is the (low-frequency) background layer of the MRC separation.

This means that assuming 150 dpi is, in general, too low for all IA PDFs, and much too low for a file that has 800dpi images.

As it stands, the PdfHandler code has no capability to adjust its rendering DPI, since it's always reading the global config variable $wgPdfHandlerDpi.

It's possible that that we should bump this up, considering that so many (1 million and counting) IA PDFs are at Commons, and they all have a DPI around 500. And in general, 150 dpi is pretty rubbish anyway - it means even a vector-only A4 page will only ever render at 1240 px across, which is only just over half of a 1080p screen.

Another option is that PdfHander gets surgery to enable it to render at the highest DPI on a given page. Note: pdfimages -l is about 3000 times (!) slower than pdfinfo -l 99999.

Another option is figure out a way to request a thumbnail at a different DPI, with $wgPdfHandlerDpi used as a default: this is T256959

Inductiveload mentioned this in T287653: Add button to Index edit form to prefetch all image thumbs.Aug 3 2021, 1:20 PM

Assassas77 subscribed.Sep 25 2021, 6:59 PM

Inductiveload mentioned this in T43614: ProofreadPage does not use image's full resolution when zooming in.Nov 21 2021, 5:43 PM

Inductiveload moved this task from Backlog to Usability/UX/Batch actions on the ProofreadPage board.Nov 21 2021, 8:16 PM

TheDJ mentioned this in T135313: PDF file lost its resolution on proofreading edit mode.Jun 14 2022, 11:12 PM

Also cross referencing - T256959 and T256848.

There was a recent patch to allow for specfiying a different DPI, (https://gerrit.wikimedia.org/r/q/Ib61eb2bd822ce8e6c60fbfc9e7090c9ba17627cb) but I'm not sure if it's fully integrated in terms of API/UI ability to make use of it.

	F31817140: test287.jpg
	May 14 2020, 6:15 AM

	F31817141: test3.jpg
	May 14 2020, 6:15 AM

Worsening of PDF book scan quality in the Wikisource Page namespaceOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Worsening of PDF book scan quality in the Wikisource Page namespace
Open, Needs TriagePublic
Actions