Page MenuHomePhabricator

Allow PDF's to be rendered at higher (or user specified DPI)
Open, Stalled, MediumPublicFeature

Description

What is the problem you are wanting to solve?:

In T256848#6272386 it was noted that page scan images from a PDF, had improved quality when rendered at a higher DPI.
(I am therefore thinking the original scans are at a higher DPI such as in 300 or 600 dpi (typical in imaging and print applications) vs 150 or 96dpi (typical for screen displays)

There is not currently a way to specify the DPI to be used from the File: page, and thus external tools ( like Ghostscript) used for rendering the initial output, later used by Thumbor to generate scaled images, do not have any means of getting this information, directly from uploaders at present.

What's the feature you want:-

*The ability to specify a DPI value and have that value taken into account during the generation of thumbnailed or scaled images based on page content in a PDF file.

  • Implement a mechanism that allows Thumbor to retreive the dpi value if present in as a URL parameter and use it to render the image via Ghostscript (for example: /War_and_Peace.djvu/page1-1536px-150dpi-War_and_Peace.djvu.jpg should/could render the image with size 1536px and a dpi of 150) (This step is identical to what @Vlad.shapik is implementing)
  • Write code to teach mediawiki-core's DjvuHandler (and also the PdfHandler extension) that the dpi value exists and that it can be used while fetching images
  • Have ProofreadPage read the dpi value from somewhere and then pass it to MediaWiki core (my instinct would be to use a Index: page parameter for storage, since that is how ProofreadPage already allows users to set the resolution of the image).

Event Timeline

(I am therefore thinking the original scans are at a higher DPI such as in 300 or 600 dpi (typical in imaging and print applications) vs 150 or 96dpi (typical for screen displays)

This is correct: typically an Internet Archive PDF is imaged at 500ppi (though 600 and 800 can happen too, and produce correspondingly worse results).

A quick and dirty comparison of the a 150dpi and 300dpi rendering of this file (p9) shows, as expected that the image quality is substantially better:

https://commons.wikimedia.org/wiki/File:Catalog_of_Title_Entries_of_Books_Etc._July_1-July_11_1891_1,_Nos._1-26_(IA_catalogoftitleen11118libr).pdf

Note that the file itself is 500 dpi, so even 300dpi is a undersample. Images on the page:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   9     0 image    1098  1497  rgb     3   8  jpx    no        38  0   167   167 7190B 0.1%
   9     1 image    3292  4490  rgb     3   8  jpx    no        39  0   500   500 53.1K 0.1%
   9     2 mask     3292  4490  -       1   1  jpx    no        39  0   500   500 53.1K 2.9%

300DPI.png (341×901 px, 169 KB)

Very unscientific timings indicate that doubling the DPI (=4× pixels) doesn't affect the time much:

$time gs -sDEVICE=jpeg -sOutputFile=- -sstdout=%stderr -dFirstPage=9 -dLastPage=9 -dSAFER -r150 -dBATCH -dNOPAUSE -q file.pdf  > /dev/null

real	0m0.580s
user	0m0.505s
sys	0m0.073s
$time gs -sDEVICE=jpeg -sOutputFile=- -sstdout=%stderr -dFirstPage=9 -dLastPage=9 -dSAFER -r300 -dBATCH -dNOPAUSE -q file.pdf  > /dev/null

real	0m0.655s
user	0m0.588s
sys	0m0.064s
  • real: 12% increase
  • user: 16% increase

Change 757892 had a related patch set uploaded (by Inductiveload; author: Inductiveload):

[operations/mediawiki-config@master] Wikisource: Increase PDF rendering resolution to 300 dpi

https://gerrit.wikimedia.org/r/757892

Paraphrasing/importing a Gerrit comment here for clarity: this is actually done by Thumbor on Wikimedia wikis, not by MediaWiki-extensions-PdfHandler

Two main issues:

Change 758053 had a related patch set uploaded (by Inductiveload; author: Inductiveload):

[operations/debs/python-thumbor-wikimedia@master] Make DPI configurable for Ghostscript

https://gerrit.wikimedia.org/r/758053

Change 757892 abandoned by Inductiveload:

[operations/mediawiki-config@master] Wikisource: Increase PDF rendering resolution to 300 dpi

Reason:

This is a Thumbor thing: start of that is https://gerrit.wikimedia.org/r/758053

https://gerrit.wikimedia.org/r/757892

From a query I made on a Ghostscript related Discord , I got some advice that MuPdf/MuTool could read certain internal tables in a PDF.

These being the MediaBox (from something identified as a "page dictionary") and 'sample size' for images. Using an appropriate calculation these could be used to determine an 'effective' resolution for a full page scan image. (This would of course only work for full page 'raster' image based page scans.)

What it be feasible for Thumbor (or whatever related components) to make a calculation about "effective" resolution, in order to make a intelligent choice about 'nominal' or rendering resolution, when providing thumbnails or images for Wikisource related purposes, such as proofreading and OCR?

Further to the above (I asked some more questions at the relevant discord):-

Mutool/MuPDF can produce an XML file representing the layout of the PDF

mutool trace <filename>

(see https://mupdf.com/docs/manual-mutool-trace.html)

This gives the Mediabox for an entire page as a Mediabox attribute of a page tag, The units being in points.

Individual images (such as page scans) are typically in a

fillimage

tag (as a child of a page), which can contain a Transform attribute, which records the position on the page (in points) . by using the Transform attribute along with the width and height attributes, the 'effective' resolution can be calculated as.

x-res=width/((abs(Transform[3rdvalue]-Transform[1stvalue]))/72)
y-res=height/((abs(Transform[4thvalue]-Transform[2ndvlaue]))/72)

In most instances x-res and y-res would be the same, giving an effective resolution.

Change 853402 had a related patch set uploaded (by Vlad.shapik; author: Vlad.shapik):

[operations/software/thumbor-plugins@master] Add ability to specify a DPI value for PDF

https://gerrit.wikimedia.org/r/853402

Hello @ShakespeareFan00
I am working on this ticket now. Could you please share with me the location of a 'property' for the File:'s entry in the image links table which you mentioned as a possible way to implement the DPI option in the FE?
And also will you be able to describe the idea of a 'modifier' template for use on sites like Wikimedia Commons and Wikisource?
For now, I just can't understand the connection between these two steps because I do not how the DPI option should look on these sites. Maybe you could give me some examples of already implemented options on these sites or something like that?
I just need to know if the front-end team(for now I don't know who will implement it in the FE) can use my BE solution to implement this configurator on the file page, for instance, that's why I am asking.

@Vlad.shapik How are we supposed to use this parameter on the MediaWiki side of things? I can take a look at integrating it with ProofreadPage frontend/backend via the Index: page as suggested in the description, not sure if the File: option is feasible.

The intent of my request was that there was a 'field' in a relevant table (image links) that could be used to setup up the higher dpi which was supplied to an external tool like Ghostscript.

Clearly the value for this dpi needs to be configurable (the intent of your backend change).

The DPI value will need to be stored with other image meta-data somehow, hence the suggestion of storing ti as a property or field.

From a UI perspective, there would need to be a 'DPI' dropdown box on file/media pages with an extension of .PDF. This dropdown box would set the releavant field or property in the relevant back end tables.

I'm not entirely sure where the image/media meta-data is stored, so others here might be better to ask about that specific technical detail.

The team that maintains the software that would be used for the front end is https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia . The Wikidata team at Wikimedia Germany can also answer your technical questions, it is lead by Lydia Pintscher. In the following text I am going to explain the basics.

The "modifier template" is just a way the author of the bug decided to say that if the structured data says "DPI": "300", then that resolution is used rather than the default 150dpi.

What is being asked for the front end is similar to what has been done in the Wikidata Page Banner Extension. That extension gets an filename from structured data and returns it to an wiki page. Codebase at https://phabricator.wikimedia.org/diffusion/EWDP/browse/master/

The front-end of this task is basically a matter of querying the stuctured data for the DPI value that should be used. The data is stored in a json file, but there are functions to get the information you need, like the Page Banner extension uses.

There is specific termology with structured data. First page and the json file is called an item, secondly the structured data in an item are called statements (simplification, but good enough for this purpose) and finally each entry of the statements have an pair of an property and a value. In this case, the property is "DPI" and the value is some number, like 300. Note that the Wikidata Page Banner Extension uses wikidata.org items, while the front-end in this bug would use items on commons.wikimedia.org. On commons the items start with "M" followed by a number, on wikidata they start with "Q" followed by a number. More at https://doc.wikimedia.org/Wikibase/master/php/docs_topics_json.html

As for whether the underlying stuff is there, the software side is, but the data is not. There is no "DPI" property yet, and the users can create that easily enough. I suppose you would need to know how to create that property on beta commons for testing purposes.

I suppose that covers the basics.

The front-end of this task is basically a matter of querying the stuctured data for the DPI value that should be used. The data is stored in a json file, but there are functions to get the information you need, like the Page Banner extension uses.

@Snaevar thank you for your idea.
As I understood from your message after getting this DPI value on the front-end side from structured data, we will need to pass it to the thumbor via a URL parameter and process a PDF file with the desired DPI value.
What do you think about it?
It seems that it looks correct because it is the way how thumbor works with getting the exact page of a pdf file, changing the language of the text for SVG files, etc.

The DPI value will need to be stored with other image meta-data somehow, hence the suggestion of storing ti as a property or field.

@ShakespeareFan00 thanks. I think it will be useful for the team that will work on this part of the task. I believe we can use structured data as it was mentioned.

The concern I have is that there would need to be a way for Wikisources to read the DPI value. I'm not sure if structured data would do that, as I wasn't sure ti was possible for one wiki to read project data from Commons directly.

The concern I have is that there would need to be a way for Wikisources to read the DPI value. I'm not sure if structured data would do that, as I wasn't sure ti was possible for one wiki to read project data from Commons directly.

That is T238798. Wikidata does have that kind of access, so if you want the DPI value to be stored there because of that, then https://www.wikidata.org/wiki/Wikidata:Notability indicates you need to discuss that on wikidata.org.

Why can the DPI value for a given PDF or DJVU not be stored in the database directly as I indicated in my original suggestion ( I.E in the image links table.)?

The front-end at Wikisource can then just query the relevant field in the image links table, to supply to thumbor. (AS a fallback most likely, given that with the OSD changes, there was a suggestion elsewhere to support direct access to scanned images on reputable external sites like Hathi and Internet Archive.)

Why can the DPI value for a given PDF or DJVU not be stored in the database directly as I indicated in my original suggestion ( I.E in the image links table.)?

The front-end at Wikisource can then just query the relevant field in the image links table, to supply to thumbor. (AS a fallback most likely, given that with the OSD changes, there was a suggestion elsewhere to support direct access to scanned images on reputable external sites like Hathi and Internet Archive.)

Oh, that is fine by me, I do not mind either way. The https://www.mediawiki.org/wiki/Manual:Imagelinks_table table is the responsibility of https://phabricator.wikimedia.org/tag/dba/ . Guess we wait for the check on how that can be done on wikisource then (there is a comment above in this ticket on that).

Why can the DPI value for a given PDF or DJVU not be stored in the database directly as I indicated in my original suggestion ( I.E in the image links table.)?

The front-end at Wikisource can then just query the relevant field in the image links table, to supply to thumbor. (AS a fallback most likely, given that with the OSD changes, there was a suggestion elsewhere to support direct access to scanned images on reputable external sites like Hathi and Internet Archive.)

I don't think this should be implemented in this particular way for the follow reasons:

  • Based on my understanding, the imagelinks table is a fairly widely used SQL table across multiple extensions in MediaWiki, adding a new field is going to heavily impact performance and storage for that table and will require updates across the whole codebase
  • This a field that is only usefull to a small subset of books on Wikisource, it has no impact/is not relevant for the rendering of the vast majority of "imagelinks" across all of MediaWiki and thus doesn't really fit the usecase of the table.
  • ProofreadPage does not even use the imagelinks table to render images on Page: pages (which is where you are expecting this to be used), ProofreadPage dynamically adds the image while building the content based on the associated Index: page.

The front-end of this task is basically a matter of querying the stuctured data for the DPI value that should be used. The data is stored in a json file, but there are functions to get the information you need, like the Page Banner extension uses.

@Snaevar thank you for your idea.
As I understood from your message after getting this DPI value on the front-end side from structured data, we will need to pass it to the thumbor via a URL parameter and process a PDF file with the desired DPI value.
What do you think about it?
It seems that it looks correct because it is the way how thumbor works with getting the exact page of a pdf file, changing the language of the text for SVG files, etc.

@Snaevar @Vlad.shapik I don't think this implementation makes sense. The context of this task is to enable hi-resoultion (hi-DPI) rendering on Wikisources. By storing this data in the Structured Data associated with a particular file, this data becomes inaccessible to the Wikisource/ProofreadPage extension which is actually doing the work of deciding which image and which page (and at what resolution) to fetch.

I personally think this is the way we should implement this task:

  • Implement a mechanism that allows Thumbor to retreive the dpi value if present in as a URL parameter and use it to render the image via Ghostscript (for example: /War_and_Peace.djvu/page1-1536px-150dpi-War_and_Peace.djvu.jpg should/could render the image with size 1536px and a dpi of 150) (This step is identical to what @Vlad.shapik is implementing)
  • Write code to teach mediawiki-core's DjvuHandler (and also the PdfHandler extension) that the dpi value exists and that it can be used while fetching images
  • Have ProofreadPage read the dpi value from somewhere and then pass it to MediaWiki core (my instinct would be to use a Index: page parameter for storage, since that is how ProofreadPage already allows users to set the resolution of the image).

Change 853402 had a related patch set uploaded (by Vlad.shapik; author: Vlad.shapik):

[operations/software/thumbor-plugins@master] Add ability to specify a DPI value for PDF

https://gerrit.wikimedia.org/r/853402

If there will be needed to specify the DPI value for PDF format.
GHOSTSCRIPT_ENGINE_DEFAULT_DPI config parameter helps to do it fast.

I personally think this is the way we should implement this task:

  • Implement a mechanism that allows Thumbor to retreive the dpi value if present in as a URL parameter and use it to render the image via Ghostscript (for example: /War_and_Peace.djvu/page1-1536px-150dpi-War_and_Peace.djvu.jpg should/could render the image with size 1536px and a dpi of 150) (This step is identical to what @Vlad.shapik is implementing)
  • Write code to teach mediawiki-core's DjvuHandler (and also the PdfHandler extension) that the dpi value exists and that it can be used while fetching images
  • Have ProofreadPage read the dpi value from somewhere and then pass it to MediaWiki core (my instinct would be to use a Index: page parameter for storage, since that is how ProofreadPage already allows users to set the resolution of the image).

Fine by me. I agree that it is harder to get from wikisource to wikidata than having the setting on wikisource itself, and wikisource is the main user of PDFs anyway. Seems like the original poster/creator of the bug finds it ok by not opposing, so, copying this to description.

The ticket was moved to Blocked on the Thumbor Migration board since the ticket requires decisions from the WMF management/engineering staff related to MediaWiki core, Structured Data, etc.

FJoseph-WMF edited subscribers, added: Atieno; removed: Hokwelum.