Page MenuHomePhabricator

Worsening of PDF book scan quality in the Wikisource Page namespace
Open, Needs TriagePublic

Description

In the Wikisource Page namespace, the quality of a PDF book scan uploaded on Wikimedia Commons is artificially worsening.
Examples:

  1. PDF in Page namespace - PDF on commons (3th page).
  2. PDF in Page namespace - PDF on commons (287th page)

Event Timeline

Ratte created this task.May 25 2019, 9:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 25 2019, 9:58 PM
Ratte updated the task description. (Show Details)May 25 2019, 10:08 PM
Restricted Application added a project: Multimedia. · View Herald TranscriptMay 25 2019, 10:28 PM
Reedy added a subscriber: Reedy.May 26 2019, 12:20 AM

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

Are they worse on the original (in which case, there is almost nothing we can do about this)? Or just in the thumbnails being displayed?

They are worse in the thumbnails.

Tpt added a subscriber: Tpt.May 26 2019, 3:25 AM

It is maybe related to T184867. Has the first page of the PDF a smaller size orresolution than the other ones?

Ratte added a comment.EditedMay 26 2019, 9:32 AM

Has the first page of the PDF a smaller size orresolution than the other ones?

Yes.
This PDF: the first page 347x524 (22,4 kB), others about 377x529 (42,3 kB)
This PDF: the first page 163x253 (2,49 kB), others about 166x255 (3,90 kB)
Title page usually has smaller size, because there's not so many text.

PS. It's not only lower resolution in ProofreadPage, text itself becomes more fuzzy than on commons.

Ankry added a subscriber: Ankry.EditedMay 26 2019, 10:15 AM

High resolution thumbnails from the file, like:
https://upload.wikimedia.org/wikipedia/commons/thumb/f/f0/%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf/page7-1834px-%D0%9F%D1%83%D1%88%D0%BA%D0%B8%D0%BD._%D0%95%D0%B2%D0%B3%D0%B5%D0%BD%D0%B8%D0%B9_%D0%9E%D0%BD%D0%B5%D0%B3%D0%B8%D0%BD_(1837).pdf.jpg
look poor and exhibit artifacts likely resulting from scalling-up jpg image with lossy compression

While PBM images extracted from this PDF file using pdfimages program from the xpdf package being the same resolution:

$ pdfimages -f 7 -l 7 Пушкин._Евгений_Онегин_\(1837\).pdf x
$ identify x-000.pbm
x-000.pbm PBM 1834x2829 1834x2829+0+0 1-bit Bilevel Gray 651KB 0.010u 0:00.010

are much higher quality.
It seems that for some reason non-maximum quality images are extracted from the PDF by the software used here. It is unrelated to the first page quality.

Xover added a subscriber: Xover.May 26 2019, 12:59 PM