We have been discussing simple browser-based options for PDF generation since the planning phase of the OCG project. This task is intended to collect options & issues.
Why browser-based PDF generation?
Browser-based PDF generation is attractive as an alternative to the LaTeX-based OCG for several reasons:
- Complete content coverage: Browsers are able to render HTML / CSS / JS content natively. This includes features like maps (T70008) and tables.
- Render performance: Complex pages can generally be rendered in a few seconds, which avoids the need for custom queueing and caching solutions.
- Simplicity of a stateless service: DC fail-over and operations are significantly simplified with a stateless service.
- Single print solution for server-side & client side: Print stylesheet improvements benefit both client-side printing (ctrl-p in the browser, or printable page), as well as the server-side equivalent.
The main downside compared to OCG is a generally less beautiful print layout, reflecting still spotty print CSS support in major browsers.
Why provide a server-side PDF render option?
While all desktop browsers, some mobile browsers (firefox) and several third party web services support print-to-PDF functionality, there might still be some value in providing a simple, reliable & ad-free solution linked from the Wikipedia UI itself. While the benefit for desktop browsers is arguably small (file-print even avoids a re-download of the data), most mobile browsers do not support printing to PDF. Mobile is dominant in developing countries with often limited connectivity. These users can benefit especially from a prominent & easy-to-use PDF download feature as a way to deal with poor connectivity. That said, given the relatively rich list of alternatives, a case can be made that we should limit the resources we invest in providing an alternative PDF render feature.
PhantomJS 2 has been updated to a fairly recent Blink / Chrome version. It is available in Debian & other distros. PDFs can be easily created using an invocation like this:
phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/wiki/Barack_Obama' /dev/stdout a4 > /tmp/obama.pdf
Latency is low, with about 2.3s for Obama on a laptop / cable connection. With performance in the single-digit seconds, there is no need for custom caching or state in the service, which greatly simplifies the infrastructure.
Issues with PhantomJS 2
- Links: Due to limitation in the Qt5 PDF rendering backend, links are currently not clickable in the generated PDF. A fix has landed in Qt5, so this issue should be resolved soon.
- Page breaks: Chrome & by extension PhantomJS 2 has relatively poor support for controlling page breaks using CSS (break-before, break-after etc). The most significant manifestation of this is that headings are often placed at the end of a preceding page, while the corresponding content follows on the next page. While ugly, this does not prevent access to the content.
The stable version of the wkhtmltopdf tool is based on an older webkit version (equivalent to PhantomJS 1). Similar to PhantomJS, it has issues with Qt's hyperlink support. Some builds of wkhtmltopdf bundle a patched version of Qt, which allows wkhtmltopdf to generate (mostly) properly hyperlinked documents. Newer versions of wkhtmltopdf are expected to pick up the same Qt5 fixes as discussed for PhantomJS.
Overall, there seem to be few reasons to use wkhtmltopdf over phantomjs or electron at this point.
Option: Electron render service
A web service wrapping Chromium 49 (as of April 2016), running under xvfb.
- Recent Chromium version, and a track record of timely updates.
- Links are working well.
- Comes with a simple / stateless web render service.
- Page breaks: Chrome has relatively poor support for controlling page breaks using CSS (break-before, break-after etc). The most significant manifestation of this is that headings are often placed at the end of a preceding page, while the corresponding content follows on the next page. While ugly, this does not prevent access to the content.
- Performance is slightly worse than phantomjs, but still < 10s for most pages. Link resolution might be responsible for a part of the difference.
You can try any URL by changing the URL in https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=<YOUR URL HERE>.
- Barack Obama, Barack_Obama (parsoid)
- San Francisco, San Francisco (parsoid)
- hi:ऑस्ट्रेलिया का इतिहास, hi:ऑस्ट्रेलिया का इतिहास (parsoid)
- ar:الحرب_العالمية_الأولى, ar:الحرب_العالمية_الأولى (parsoid)
- de:Berlin, de:Berlin (parsoid)
Browser print improvement projects
- The Open Source Publishing initiative has done some recent work on html2print CSS / JS tools for better pagination & margin support. They are using CSS regions, which was briefly supported in Chrome, but has since been dropped again. A polyfill is available at https://github.com/FremyCompany/css-regions-polyfill.
- The older BookJS project also used CSS regions & JS pagination.