Investigate alternatives to ghostscript for PDF thumbnailing
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	hnowlan
	Jun 19 2023, 10:31 AM

Description

We currently use Ghostscript for PDF thumbnailling on thumbor - it drives a large percentage of our requests and is quite resource-intensive. It has [[ T337649#8944872 | been mentioned ]] that there are other engines with potentially large performance improvements. Most comparisons I can find between different rendering tools are fairly old so we might need to do our own comparison.

Doing this work also involves us developing performance benchmarking tooling for Thumbor more generally, which might need to be its own task.

Related Objects

Mentioned In: T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad
Mentioned Here: T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad

Event Timeline

hnowlan created this task.Jun 19 2023, 10:31 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 19 2023, 10:31 AM

hnowlan mentioned this in T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad.Jun 19 2023, 10:33 AM

RhinosF1 subscribed.Jun 19 2023, 10:35 AM

Xover subscribed.Jun 19 2023, 11:21 AM

Ladsgroup subscribed.Jun 19 2023, 11:46 AM

Some highly informal testing using a bone stupid harness (time()+system()+time(), essentially), on my 16" MacBook Pro (2021, so M1 Max), running sequentially through all 2052 pages of the 366.3MB PDF from IA: latindictionaryf00andr. I used gs options extracted from PdfHandler.php and superficially equivalent ones picked from pdftocairo's man page. The exception was the value for $wgPdfHandlerDpi that I couldn't figure out where it got set, so for that I just picked 300 DPI more or less at random. I have not really compared the output, except to observe that pdftocairo seems to produce distinctly smaller file sizes for the same resolution (probably different JPEG compression settings). Times are in fractional seconds.

Util	Min.	Avg.	Max
ghostscript	0.469	0.755	1.054
pdftocairo	0.458	0.509	1.641

That's 33% less time with pdftocairo on average. Nowhere near the "10x" claimed.

However, this probably isn't very representative of performance in the context of Thumbor + JobRunner + Swift. My guess is that a lot of that time is spent on I/O, where a fast SSD soldered onto a unified memory architecture like Apple Silicon is going to have a ridiculous advantage over a networked filesystem on shared hosting. I would also guess that in the real world waiting for CPU cycles and shuffling bytes around between disk buffers and where the CPU can get at them is going to be a orders-of-magnitude bigger factor.

In other words, I think it's entirely possible that it could show a significant improvement in that environment even though such is not visible in my testing above.

Another factor to consider is that pdftocairo can resize the generated image, so it would be possible to do that in a single operation, avoiding at least one fork() and streaming the data to ImageMagick. Not sure how that would fit architecture-wise, but performance-wise it should have at least measurably better characteristics.

MatthewVernon subscribed.Aug 7 2023, 8:15 AM

Arcorann subscribed.May 20 2024, 6:19 AM

While we're here, can we also implement something that doesn't have so many image issues when thumbnailing PDFs? Having run into yet another issue when proofreading for Wikisource in which the text somehow just fails to render on the image (https://commons.wikimedia.org/w/index.php?title=File%3AThe_sayings_of_Confucius%3B_a_new_translation_of_the_greater_part_of_the_Confucian_analects_(IA_sayingsofconfuci00confiala).pdf&page=28), only to then go to the original PDF hosted on Commons and easily read off the text from there, after having to do the same for a number of other works due to blurring of text, I really think we can do a lot better than what we have now.

Investigate alternatives to ghostscript for PDF thumbnailingOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Investigate alternatives to ghostscript for PDF thumbnailing
Open, Needs TriagePublic
Actions