In the Wikisource Page namespace, the quality of a PDF book scan uploaded on Wikimedia Commons is artificially worsening.
Examples:
Description
Related Objects
- Mentioned In
- T135313: PDF file lost its resolution on proofreading edit mode
T43614: ProofreadPage does not use image's full resolution when zooming in
T287653: Add button to Index edit form to prefetch all image thumbs
T278623: Create a Section for Numerically Sequencing Images on Index ns
T257025: Provide a way of serving high quality scans on a per-page basis at Wikisource (such as those hosted at external source)
T256848: Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?). - Mentioned Here
- T256848: Images from PDF displayed with degraded quality ( when background sized smaller than mask and other layers?).
T256959: Allow PDF's to be rendered at higher (or user specified DPI)
T184867: Unexpectedly low scan resolution in Page namespace for some DjVu books
Event Timeline
Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?
It is maybe related to T184867. Has the first page of the PDF a smaller size orresolution than the other ones?
Yes.
This PDF: the first page 347x524 (22,4 kB), others about 377x529 (42,3 kB)
This PDF: the first page 163x253 (2,49 kB), others about 166x255 (3,90 kB)
Title page usually has smaller size, because there's not so many text.
PS. It's not only lower resolution in ProofreadPage, text itself becomes more fuzzy than on commons.
High resolution thumbnails from the file, like:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf/page7-1834px-%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf.jpg
look poor and exhibit artifacts likely resulting from scalling-up jpg image with lossy compression
While PBM images extracted from this PDF file using pdfimages program from the xpdf package being the same resolution:
$ pdfimages -f 7 -l 7 Пушкин._Евгений_Онегин_\(1837\).pdf x $ identify x-000.pbm x-000.pbm PBM 1834x2829 1834x2829+0+0 1-bit Bilevel Gray 651KB 0.010u 0:00.010
are much higher quality.
It seems that for some reason non-maximum quality images are extracted from the PDF by the software used here. It is unrelated to the first page quality.
The PDF thumbnailing works in two steps: First, Ghostscript extracts a page to JPG at the original resolution of the page. Then ImageMagick scales it to the requested size. The pages in this document vary in size around about 340x530 +/- 5 px, which causes some of the quality loss as images are upscaled. (ProofreadPage thinks every page in the file is 339 × 527).
The PDF says the page's 338x527px, so Ghostscript writes a JPEG at 338x527. That doesn't sound like a bug in the thumbnailer to me, it sounds like a problem with the PDF.
The problem here is a mismatch between the PDF's page size (in pts) and the image stream size.
Taking the file above (as pushkin.pdf) and only regarding the x-axis (the same logic holds for the y-axis):
$ pdfimages push.pdf -f 1 -l 1 -list page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 1813 2815 gray 1 1 ccitt no 1899 0 800 800 2140B 0.3%
So we see we have an image of width 1813 (units: [px]) with a PPI of 800 (units: [px / in]).
However, the PDF page size is specified in pt, which is defined as 1/72 inch. This leads to a page size of (1813 [px] / 800 [px/in]) * 72 [pt/in] = 163.17 [pt]. Which is indeed exactly what we see:
$ pdfinfo push.pdf ... Page size: 163.17 x 253.35 pts ...
This then appears to be rendered by the thumbnail at 150 ppi, resulting in 163.17 [pt] / 72 [pt/in] * 150 [px/in] = 339.93 [px], which is truncated to 339px.
Because the PDF images are specified as being small but "dense" (imagine a physically small book printed on a very high-quality press), rendering this PDF at a DPI like 150 is going to produce very small downscaled images (like taking a photo of said book with a rubbish camera). To make this image better, it should be rendered at a higher DPI, where 800 would produce the original images.
This calculation can be seen in PdfImage:php, line 99
$width = intval( trim( $size[0] ) / 72 * $wgPdfHandlerDpi );
150 dpi is the default for $wgPdfHandlerDpi per https://www.mediawiki.org/wiki/Extension:PdfHandler
Normally, PDFs that come from the IA have images like this:
$ pdfimages conspiracietrago00chap_0.pdf -f 1 -l 1 -list page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio -------------------------------------------------------------------------------------------- 1 0 image 808 1198 rgb 3 8 jpx no 756 0 167 167 8234B 0.3% 1 1 image 2422 3593 rgb 3 8 jpx no 757 0 501 500 15.0K 0.1% 1 2 mask 2422 3593 - 1 1 jpx no 757 0 501 500 15.0K 1.4%
Where the "main content" of the image is 500ppi (sometimes a tiny bit off due to rounding of some sort), and the 167 DPI image is the (low-frequency) background layer of the MRC separation.
This means that assuming 150 dpi is, in general, too low for all IA PDFs, and much too low for a file that has 800dpi images.
As it stands, the PdfHandler code has no capability to adjust its rendering DPI, since it's always reading the global config variable $wgPdfHandlerDpi.
It's possible that that we should bump this up, considering that so many (1 million and counting) IA PDFs are at Commons, and they all have a DPI around 500. And in general, 150 dpi is pretty rubbish anyway - it means even a vector-only A4 page will only ever render at 1240 px across, which is only just over half of a 1080p screen.
Another option is that PdfHander gets surgery to enable it to render at the highest DPI on a given page. Note: pdfimages -l is about 3000 times (!) slower than pdfinfo -l 99999.
Another option is figure out a way to request a thumbnail at a different DPI, with $wgPdfHandlerDpi used as a default: this is T256959
Also cross referencing - T256959 and T256848.
There was a recent patch to allow for specfiying a different DPI, (https://gerrit.wikimedia.org/r/q/Ib61eb2bd822ce8e6c60fbfc9e7090c9ba17627cb) but I'm not sure if it's fully integrated in terms of API/UI ability to make use of it.