We have been discussing simple browser-based options for PDF generation since the planning phase of the OCG project. This task is intended to collect options & issues.
## Why browser-based PDF generation?
Browser-based PDF generation is attractive as an alternative to the LaTeX-based OCG for several reasons:
- Complete content coverage: Browsers are able to render HTML / CSS / JS content natively. This includes features like maps (T70008) and tables.
- Render performance: Complex pages can generally be rendered in a few seconds, which avoids the need for custom queueing and caching solutions.
- Simplicity of a stateless service: DC fail-over and operations are significantly simplified with a stateless service.
- Single print solution for server-side & client side: Print stylesheet improvements benefit both client-side printing (ctrl-p in the browser, or printable page), as well as the server-side equivalent.
The main downside compared to OCG is a generally less beautiful print layout, reflecting still spotty print CSS support in major browsers.
### Why provide a server-side PDF render option?
While all desktop browsers, some mobile browsers (firefox) and several third party [web](http://www.web2pdfconvert.com/) [services](https://www.printfriendly.com/) support print-to-PDF functionality, there might still be some value in providing a simple, reliable & ad-free solution linked from the Wikipedia UI itself. While the benefit for desktop browsers is arguably small (file-print even avoids a re-download of the data), most mobile browsers do not support printing to PDF. Mobile is dominant in developing countries with often limited connectivity. These users can benefit especially from a prominent & easy-to-use PDF download feature as a way to deal with poor connectivity. That said, given the relatively rich list of alternatives, a case can be made that we should limit the resources we invest in providing an alternative PDF render feature.
## Option: [PhantomJS](http://phantomjs.org/)
PhantomJS 2 has been updated to a fairly recent [Blink / Chrome](http://www.chromium.org/blink) version. It is available in Debian & other distros. PDFs can be easily created using an invocation like this:
`phantomjs /usr/share/doc/phantomjs/examples/rasterize.js 'http://en.wikipedia.org/wiki/Barack_Obama' /dev/stdout a4 > /tmp/obama.pdf`
Latency is low, with about 2.3s for Obama on a laptop / cable connection. With performance in the single-digit seconds, there is no need for custom caching or state in the service, which greatly simplifies the infrastructure.
### Issues with PhantomJS 2
- **Links**: Due to limitation in the Qt5 PDF rendering backend, links are currently not clickable in the generated PDF. A [fix has landed in Qt5](https://github.com/ariya/phantomjs/issues/10196), so this issue should be resolved soon.
- **Page breaks**: Chrome & by extension PhantomJS 2 has relatively poor support for controlling page breaks using CSS (`break-before`, `break-after` etc). The most significant manifestation of this is that headings are often placed at the end of a preceding page, while the corresponding content follows on the next page. While ugly, this does not prevent access to the content.
### Example output
- [Barack Obama](https://people.wikimedia.org/~gwicke/obama_phantomjs.pdf)
- [San Francisco](https://people.wikimedia.org/~gwicke/sf_phantomjs.pdf)
- [hi:ऑस्ट्रेलिया का इतिहास](https://people.wikimedia.org/~gwicke/1g0m_phantomjs.pdf)
- [ar:الحرب_العالمية_الأولى](https://people.wikimedia.org/~gwicke/ar_phantomjs.pdf)
- [de:Berlin](https://people.wikimedia.org/~gwicke/berlin_phantomjs.pdf)
## Option: [wkhtmltopdf](http://wkhtmltopdf.org/)
The stable version of the wkhtmltopdf tool is based on an older webkit version (equivalent to PhantomJS 1). Similar to PhantomJS, it has issues with Qt's hyperlink support. Some builds of wkhtmltopdf bundle a patched version of Qt, which allows wkhtmltopdf to generate (mostly) properly hyperlinked documents. Newer versions of wkhtmltopdf are expected to pick up the same Qt5 fixes as discussed for PhantomJS.
Overall, there seem to be few reasons to use wkhtmltopdf over phantomjs at this point.
## Option: [Electron render service](https://github.com/msokk/electron-render-service)
A web service wrapping a GTK webkit version, running under xvfb. Links are working quite well.
## Example output:
- [Barack Obama](https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://en.wikipedia.org/wiki/Barack_Obama)
- [San Francisco](https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://en.wikipedia.org/wiki/San_Francisco)
- [hi:ऑस्ट्रेलिया का इतिहास](https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://hi.wikipedia.org/wiki/ऑस्ट्रेलिया का इतिहास)
- [ar:الحرب_العالمية_الأولى](https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://ar.wikipedia.org/wiki/الحرب_العالمية_الأولى)
- [de:Berlin](https://pdf-electron.wmflabs.org/pdf?accessKey=secret&url=https://de.wikipedia.org/wiki/Berlin)
## Browser print improvement projects
- The [Open Source Publishing initiative](http://osp.kitchen/) has done some recent work on [html2print CSS / JS tools](https://github.com/osp/osp.tools.html2print) for better pagination & margin support. They are using CSS regions, which was briefly supported in Chrome, but has since been dropped again. A polyfill is available at https://github.com/FremyCompany/css-regions-polyfill.
- The older [BookJS project](https://github.com/booktype/BookJS) also used CSS regions & JS pagination.