In T134205 we tested different browser-based PDF rendering options, and identified the electron render service as a clear winner. Wikimedia Germany has been working on improving table support (see T73808), which is a community wishlist item from the German Wikipedia. They have since asked the German community for feedback on the Electron rendering, and so far there is unanimous support.
As a next step, the WMDE-TechWish is looking to offer the Electron based rendering as part of the "This page as PDF" functionality. To support this, we need
- a production deploy of the electron rendering service, and
- a production API end point.
Service deployment
The electron render service is a stateless third party node service based on Electron / Chrome. Resource usage is fairly moderate, with most pages taking 1-2s to render. Based on OCG request rates, we expect about 2 req/s initially. Each render worker (fixed number) typically uses ~120-200m RAM, peaking at ~500m for really large documents. Resource usage is bounded primarily with a configurable render timeout, which in stress testing triggered reliably & immediately freed resources. Given limited resource usage and stateless operation, the most obvious deploy target would be the SCB cluster.
The service's NPM install pulls in a binary Electron build from upstream. While this is not ideal, it is partly reflective of the fast pace of Electron development. Packaging Electron as a deb would likely be non-trivial, as it essentially involves a full build of Chromium & all its dependencies. For now, the easiest option will be to check the binary dependency into the deploy repository, in line with other binary modules.
Security considerations
The underlying rendering engine (Chromium) is a complex piece of software with a large attack surface, but has many layered security measures in place to prevent attacks. In combination with firejail and systemd limits, the risk of local exploits should be fairly low. Setting up firejail with X11 / xvfb support is a bit tricky. Options for doing so as well as other options for locking down Electron further are discussed in T143336.
The service loads a HTML page given by a supplied URL, and then loads any resources linked from that HTML, as a browser would. The service will only be exposed through a "this page as HTML" API, which means that we can & will restrict the loaded pages to sanitized article HTML. While sanitization ensures that this HTML does not contain references to resources on 10.* IPs, it would be good to not rely on this exclusively. An option for restricting access from this service to public IPs would be to set up an iptables rule matching on the service user, and dropping any requests to the private production IPs. Another might be to use a proxy, although this would likely affect performance negatively. This sub-issue is tracked in T148567.
Public API & caching
An obvious place for a PDF render end point is /api/rest_v1/page/pdf/{title}, in line with other formats like html, data-parsoid, mobile sections etc. @Pchelolo already has already prototyped a spec for this end point.
Given the fairly efficient render backend & low expected request volumes, basic Varnish caching & a relatively low per-IP rate limit should be sufficient to ensure reliable operation. Initially, a relatively low TTL & no purging should be sufficient. If request volume becomes an issue, we can move to active purging & longer TTLs.
Ownership
We (Services) will own the backend service & API end point. As this is a generic / stateless third party service under active development, we expect it to require little ongoing maintenance effort. The service also supports rasterizing web content including SVGs, which might come in handy for other internal uses (SVG to PNG, visual diffing) in the future.
Wikimedia Germany's WMDE-TechWish is looking into exposing "This page as PDF" functionality in the UI. The WMF #reading team & the WMDE-TechWish are working on improving print styling in general: T135022, T142207.
Other notes
- fonts module pulling in pretty much all fonts we'll need.