Page MenuHomePhabricator

Options for browser-based server-side PDF generation
Closed, ResolvedPublic

Description

We have been discussing simple browser-based options for PDF generation since the planning phase of the OCG project. This task is intended to collect options & issues.

Why browser-based PDF generation?

Browser-based PDF generation is attractive as an alternative to the LaTeX-based OCG for several reasons:

  • Complete content coverage: Browsers are able to render HTML / CSS / JS content natively. This includes features like maps (T70008) and tables.
  • Render performance: Complex pages can generally be rendered in a few seconds, which avoids the need for custom queueing and caching solutions.
  • Simplicity of a stateless service: DC fail-over and operations are significantly simplified with a stateless service.
  • Single print solution for server-side & client side: Print stylesheet improvements benefit both client-side printing (ctrl-p in the browser, or printable page), as well as the server-side equivalent.

The main downside compared to OCG is a generally less beautiful print layout, reflecting still spotty print CSS support in major browsers.

Why provide a server-side PDF render option?

While all desktop browsers, some mobile browsers (firefox) and several third party web services support print-to-PDF functionality, there might still be some value in providing a simple, reliable & ad-free solution linked from the Wikipedia UI itself. While the benefit for desktop browsers is arguably small (file-print even avoids a re-download of the data), most mobile browsers do not support printing to PDF. Mobile is dominant in developing countries with often limited connectivity. These users can benefit especially from a prominent & easy-to-use PDF download feature as a way to deal with poor connectivity. That said, given the relatively rich list of alternatives, a case can be made that we should limit the resources we invest in providing an alternative PDF render feature.

Option: PhantomJS

PhantomJS 2 has been updated to a fairly recent Blink / Chrome version. It is available in Debian & other distros. PDFs can be easily created using an invocation like this:

phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/wiki/Barack_Obama' /dev/stdout a4 > /tmp/obama.pdf

Latency is low, with about 2.3s for Obama on a laptop / cable connection. With performance in the single-digit seconds, there is no need for custom caching or state in the service, which greatly simplifies the infrastructure.

Issues with PhantomJS 2

  • Links: Due to limitation in the Qt5 PDF rendering backend, links are currently not clickable in the generated PDF. A fix has landed in Qt5, so this issue should be resolved soon.
  • Page breaks: Chrome & by extension PhantomJS 2 has relatively poor support for controlling page breaks using CSS (break-before, break-after etc). The most significant manifestation of this is that headings are often placed at the end of a preceding page, while the corresponding content follows on the next page. While ugly, this does not prevent access to the content.

Example output

Option: wkhtmltopdf

The stable version of the wkhtmltopdf tool is based on an older webkit version (equivalent to PhantomJS 1). Similar to PhantomJS, it has issues with Qt's hyperlink support. Some builds of wkhtmltopdf bundle a patched version of Qt, which allows wkhtmltopdf to generate (mostly) properly hyperlinked documents. Newer versions of wkhtmltopdf are expected to pick up the same Qt5 fixes as discussed for PhantomJS.

Overall, there seem to be few reasons to use wkhtmltopdf over phantomjs or electron at this point.

Option: Electron render service

A web service wrapping Chromium 49 (as of April 2016), running under xvfb.

Advantages

Issues

  • Page breaks: Chrome has relatively poor support for controlling page breaks using CSS (break-before, break-after etc). The most significant manifestation of this is that headings are often placed at the end of a preceding page, while the corresponding content follows on the next page. While ugly, this does not prevent access to the content.
  • Performance is slightly worse than phantomjs, but still < 10s for most pages. Link resolution might be responsible for a part of the difference.

Example output

You can try any URL by changing the URL in https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=<YOUR URL HERE>.

Browser print improvement projects

Event Timeline

Notes on the electron setup in labs:

  • copy .fonts.conf from the electron-render-service repo to the user's home directory for proper sub-pixel font rendering.
  • Fonts:
    • ttf-mscorefonts-installer (from non-free)
    • fonts-dejavu
    • fonts-liberation
    • fonts-lohit-deja (for hindi)
  • commandline used for testing: RENDERER_ACCESS_KEY=secret nohup xvfb-run ./node_modules/.bin/electron-render-service &

The chrome team is also working on a native headless mode, which might provide yet another alternative with similar functionality in the future. One interesting feature they have is a concept of 'virtual time', which ignores timer delays if nothing else needs to be done. This might speed up some pages, but probably won't matter much for our content.

While this effort will take some time to reach feature parity with electron, it seems clear that there won't be a shortage of options for browser-based PDF generation any time soon.

Comments pasted from an email thread on this topic:

The electron service looks really nice. It seems to be an endrun around a number of issues with phantomjs which have previously stood in the way of a better HTML-based render pipeline (and may be relevant to the visual diff testing subbu has been doing with phantomjs as well).

The output is not quite as nice as our current output in some ways; for example, compare:
https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=Barack+Obama&oldid=719486829&writer=rdf2latex
to

rendered with electron.
But using electron here is consistent with the path forward I've earlier proposed. The electron output here is basically "printable page" (but powered by Parsoid DOM not the mediawiki parser). Let's devote some effort to the CSS here to make this more beautiful, and replace the standard "Printable page" in the sidebar with this new, more beautiful CSS. When we've done that, we can use electron inside the OCG framework to render the printable page HTML (and the OCG framework will handle rendering multiple pages, assembling a book with cover pages, etc). That would be a solid path forward that would improve the reading experience for everyone, and reduce the amount of adhoc rendering code in OCG.

My proposal is to *eventually* make "Download as PDF" identical to printable view and export (using one of the services mentioned here), assuming a modern browser. But at the moment:

  1. "Printable view" looks terrible. Export to PDF is multicolumn output with full justification, indic language support, repositioned and high-DPI images, TeX rendering of formulae, proper attribution (as required by our CC license), footnote and hyperlinks which actually work in the PDF, page numbering, etc.

Compare:
https://en.wikipedia.org/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=United+States&oldid=717481903&writer=rdf2latex
to:
https://en.wikipedia.org/w/index.php?title=United_States&printable=yes

  1. Printable view is no replacement for Extension:Collection, which is heavily used by wikibooks and wikisource (and all our other projects, for example, articles on medical subjects curated by WikiProject:Medicine on enwiki). This allows users to collect a number of article pages together into a book, with proper title pages, chapter headings, and a table of contents.

In the past, I lobbied to make Extension:Gather into a proper generalized replacement for some of the features of Extension:Collection, but that team decided to go a different way. There is certainly a need for a general "collection of articles" mechanism in core, which could be shared/used for watchlists, mobile favorites, bookmarks, work lists, and a number of other feature as well as for the "book" feature of Extension:Collection.

It seems that the reading team is willing to devote some resources to using the latest HTML/CSS features to bring "printable view" closer to feature parity with "export to PDF" (or even surpass it) (T135022). This wouldn't replace the use of Extension:Collection, which would still be needed to create and render collections of articles, but it would simplify the actual rendering step, which would then basically just use a headless browser to render the "printable view" as PDF and then stitch the pages together. For some users this output would still be superior, even for a single page, because we'd be using the latest browser and the latest HTML/CSS features when we do the rendering, whereas they might be using an older or mobile browser which didn't fully support all the fancy page-oriented HTML/CSS features.

This would allow us to eventually remove/reduce the OCG functionality by improving "printable page" and generalizing the "collection" features. But we're not there yet.

I saw the demo at Wikimania Hackathon Showcase 2016. In case this is useful may I point to http://athenapdf.com/, an Electron-based MIT-licensed service to convert to PDF?

@JeanFred, thanks for the pointer! Functionally, the athena service is very similar to the node-based electron render service. Since both let electron do all the work, output and performance are basically identical.

@GWicke Indeed ; although output can be fairly different too ;-) −pdf-electron vs Athena. AFAIK a lot of work went into Athena to support languages. Anyway, my point was that might be some pages to take from that book :)

(disclaimer: Athena was written by one of my colleagues)

@JeanFred, that looks more like a missing font to me. I installed a couple in the labs instance, but no doubt some more are needed for less common scripts.

@JeanFred, that looks more like a missing font to me. I installed a couple in the labs instance, but no doubt some more are needed for less common scripts.

Sure. I was under the impression that WMF folks were writing that service, and was very much wondering why you would go through that trouble if there was something already available. I see now you are indeed using an available solution, so it makes more sense now. :)

GWicke claimed this task.

As a result of this evaluation, we have decided on going with electron in the form of https://github.com/msokk/electron-render-service. T142226 is tracking the follow-up productization work.

Closing this task, as the evaluation is done.