Page MenuHomePhabricator

Investigate alternatives to ghostscript for PDF thumbnailing
Open, Needs TriagePublic

Description

We currently use Ghostscript for PDF thumbnailling on thumbor - it drives a large percentage of our requests and is quite resource-intensive. It has [[ T337649#8944872 | been mentioned ]] that there are other engines with potentially large performance improvements. Most comparisons I can find between different rendering tools are fairly old so we might need to do our own comparison.

Doing this work also involves us developing performance benchmarking tooling for Thumbor more generally, which might need to be its own task.

Event Timeline

Some highly informal testing using a bone stupid harness (time()+system()+time(), essentially), on my 16" MacBook Pro (2021, so M1 Max), running sequentially through all 2052 pages of the 366.3MB PDF from IA: latindictionaryf00andr. I used gs options extracted from PdfHandler.php and superficially equivalent ones picked from pdftocairo's man page. The exception was the value for $wgPdfHandlerDpi that I couldn't figure out where it got set, so for that I just picked 300 DPI more or less at random. I have not really compared the output, except to observe that pdftocairo seems to produce distinctly smaller file sizes for the same resolution (probably different JPEG compression settings). Times are in fractional seconds.

UtilMin.Avg.Max
ghostscript0.4690.7551.054
pdftocairo0.4580.5091.641

That's 33% less time with pdftocairo on average. Nowhere near the "10x" claimed.

However, this probably isn't very representative of performance in the context of Thumbor + JobRunner + Swift. My guess is that a lot of that time is spent on I/O, where a fast SSD soldered onto a unified memory architecture like Apple Silicon is going to have a ridiculous advantage over a networked filesystem on shared hosting. I would also guess that in the real world waiting for CPU cycles and shuffling bytes around between disk buffers and where the CPU can get at them is going to be a orders-of-magnitude bigger factor.

In other words, I think it's entirely possible that it could show a significant improvement in that environment even though such is not visible in my testing above.

Another factor to consider is that pdftocairo can resize the generated image, so it would be possible to do that in a single operation, avoiding at least one fork() and streaming the data to ImageMagick. Not sure how that would fit architecture-wise, but performance-wise it should have at least measurably better characteristics.

While we're here, can we also implement something that doesn't have so many image issues when thumbnailing PDFs? Having run into yet another issue when proofreading for Wikisource in which the text somehow just fails to render on the image (https://commons.wikimedia.org/w/index.php?title=File%3AThe_sayings_of_Confucius%3B_a_new_translation_of_the_greater_part_of_the_Confucian_analects_(IA_sayingsofconfuci00confiala).pdf&page=28), only to then go to the original PDF hosted on Commons and easily read off the text from there, after having to do the same for a number of other works due to blurring of text, I really think we can do a lot better than what we have now.