Page MenuHomePhabricator

Worsening of PDF book scan quality in the Wikisource Page namespace
Open, Needs TriagePublic

Description

In the Wikisource Page namespace, the quality of a PDF book scan uploaded on Wikimedia Commons is artificially worsening.
Examples:

  1. PDF in Page namespace - PDF on commons (3th page).
  2. PDF in Page namespace - PDF on commons (287th page)

Event Timeline

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

They are worse in the thumbnails.

It is maybe related to T184867. Has the first page of the PDF a smaller size orresolution than the other ones?

Has the first page of the PDF a smaller size orresolution than the other ones?

Yes.
This PDF: the first page 347x524 (22,4 kB), others about 377x529 (42,3 kB)
This PDF: the first page 163x253 (2,49 kB), others about 166x255 (3,90 kB)
Title page usually has smaller size, because there's not so many text.

PS. It's not only lower resolution in ProofreadPage, text itself becomes more fuzzy than on commons.

High resolution thumbnails from the file, like:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf/page7-1834px-%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf.jpg
look poor and exhibit artifacts likely resulting from scalling-up jpg image with lossy compression

While PBM images extracted from this PDF file using pdfimages program from the xpdf package being the same resolution:

$ pdfimages -f 7 -l 7 Пушкин._Евгений_Онегин_\(1837\).pdf x
$ identify x-000.pbm
x-000.pbm PBM 1834x2829 1834x2829+0+0 1-bit Bilevel Gray 651KB 0.010u 0:00.010

are much higher quality.
It seems that for some reason non-maximum quality images are extracted from the PDF by the software used here. It is unrelated to the first page quality.

The PDF thumbnailing works in two steps: First, Ghostscript extracts a page to JPG at the original resolution of the page. Then ImageMagick scales it to the requested size. The pages in this document vary in size around about 340x530 +/- 5 px, which causes some of the quality loss as images are upscaled. (ProofreadPage thinks every page in the file is 339 × 527).


The PDF says the page's 338x527px, so Ghostscript writes a JPEG at 338x527. That doesn't sound like a bug in the thumbnailer to me, it sounds like a problem with the PDF.

The problem here is a mismatch between the PDF's page size (in pts) and the image stream size.

Taking the file above (as pushkin.pdf) and only regarding the x-axis (the same logic holds for the y-axis):

$ pdfimages push.pdf -f 1 -l 1 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1813  2815  gray    1   1  ccitt  no      1899  0   800   800 2140B 0.3%

So we see we have an image of width 1813 (units: [px]) with a PPI of 800 (units: [px / in]).

However, the PDF page size is specified in pt, which is defined as 1/72 inch. This leads to a page size of (1813 [px] / 800 [px/in]) * 72 [pt/in] = 163.17 [pt]. Which is indeed exactly what we see:

$ pdfinfo push.pdf
...
Page size:      163.17 x 253.35 pts
...

This then appears to be rendered by the thumbnail at 150 ppi, resulting in 163.17 [pt] / 72 [pt/in] * 150 [px/in] = 339.93 [px], which is truncated to 339px.

Because the PDF images are specified as being small but "dense" (imagine a physically small book printed on a very high-quality press), rendering this PDF at a DPI like 150 is going to produce very small downscaled images (like taking a photo of said book with a rubbish camera). To make this image better, it should be rendered at a higher DPI, where 800 would produce the original images.


This calculation can be seen in PdfImage:php, line 99

$width  = intval( trim( $size[0] ) / 72 * $wgPdfHandlerDpi );

150 dpi is the default for $wgPdfHandlerDpi per https://www.mediawiki.org/wiki/Extension:PdfHandler


Normally, PDFs that come from the IA have images like this:

$  pdfimages conspiracietrago00chap_0.pdf -f 1 -l 1 -list
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image     808  1198  rgb     3   8  jpx    no       756  0   167   167 8234B 0.3%
   1     1 image    2422  3593  rgb     3   8  jpx    no       757  0   501   500 15.0K 0.1%
   1     2 mask     2422  3593  -       1   1  jpx    no       757  0   501   500 15.0K 1.4%

Where the "main content" of the image is 500ppi (sometimes a tiny bit off due to rounding of some sort), and the 167 DPI image is the (low-frequency) background layer of the MRC separation.

This means that assuming 150 dpi is, in general, too low for all IA PDFs, and much too low for a file that has 800dpi images.

As it stands, the PdfHandler code has no capability to adjust its rendering DPI, since it's always reading the global config variable $wgPdfHandlerDpi.

It's possible that that we should bump this up, considering that so many (1 million and counting) IA PDFs are at Commons, and they all have a DPI around 500. And in general, 150 dpi is pretty rubbish anyway - it means even a vector-only A4 page will only ever render at 1240 px across, which is only just over half of a 1080p screen.

Another option is that PdfHander gets surgery to enable it to render at the highest DPI on a given page. Note: pdfimages -l is about 3000 times (!) slower than pdfinfo -l 99999.

Another option is figure out a way to request a thumbnail at a different DPI, with $wgPdfHandlerDpi used as a default: this is T256959