Productize the Electron PDF render service & create a REST API end point
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Aug 5 2016, 5:12 PM

Description

In T134205 we tested different browser-based PDF rendering options, and identified the electron render service as a clear winner. Wikimedia Germany has been working on improving table support (see T73808), which is a community wishlist item from the German Wikipedia. They have since asked the German community for feedback on the Electron rendering, and so far there is unanimous support.

As a next step, the WMDE-TechWish is looking to offer the Electron based rendering as part of the "This page as PDF" functionality. To support this, we need

a production deploy of the electron rendering service, and
a production API end point.

Service deployment

The electron render service is a stateless third party node service based on Electron / Chrome. Resource usage is fairly moderate, with most pages taking 1-2s to render. Based on OCG request rates, we expect about 2 req/s initially. Each render worker (fixed number) typically uses ~120-200m RAM, peaking at ~500m for really large documents. Resource usage is bounded primarily with a configurable render timeout, which in stress testing triggered reliably & immediately freed resources. Given limited resource usage and stateless operation, the most obvious deploy target would be the SCB cluster.

The service's NPM install pulls in a binary Electron build from upstream. While this is not ideal, it is partly reflective of the fast pace of Electron development. Packaging Electron as a deb would likely be non-trivial, as it essentially involves a full build of Chromium & all its dependencies. For now, the easiest option will be to check the binary dependency into the deploy repository, in line with other binary modules.

Security considerations

The underlying rendering engine (Chromium) is a complex piece of software with a large attack surface, but has many layered security measures in place to prevent attacks. In combination with firejail and systemd limits, the risk of local exploits should be fairly low. Setting up firejail with X11 / xvfb support is a bit tricky. Options for doing so as well as other options for locking down Electron further are discussed in T143336.

The service loads a HTML page given by a supplied URL, and then loads any resources linked from that HTML, as a browser would. The service will only be exposed through a "this page as HTML" API, which means that we can & will restrict the loaded pages to sanitized article HTML. While sanitization ensures that this HTML does not contain references to resources on 10.* IPs, it would be good to not rely on this exclusively. An option for restricting access from this service to public IPs would be to set up an iptables rule matching on the service user, and dropping any requests to the private production IPs. Another might be to use a proxy, although this would likely affect performance negatively. This sub-issue is tracked in T148567.

Public API & caching

An obvious place for a PDF render end point is /api/rest_v1/page/pdf/{title}, in line with other formats like html, data-parsoid, mobile sections etc. @Pchelolo already has already prototyped a spec for this end point.

Given the fairly efficient render backend & low expected request volumes, basic Varnish caching & a relatively low per-IP rate limit should be sufficient to ensure reliable operation. Initially, a relatively low TTL & no purging should be sufficient. If request volume becomes an issue, we can move to active purging & longer TTLs.

Ownership

We (Services) will own the backend service & API end point. As this is a generic / stateless third party service under active development, we expect it to require little ongoing maintenance effort. The service also supports rasterizing web content including SVGs, which might come in handy for other internal uses (SVG to PNG, visual diffing) in the future.

Wikimedia Germany's WMDE-TechWish is looking into exposing "This page as PDF" functionality in the UI. The WMF #reading team & the WMDE-TechWish are working on improving print styling in general: T135022, T142207.

Other notes

fonts module pulling in pretty much all fonts we'll need.

Related Objects
Search...

Status	Assigned	Task
Resolved	• Jhernandez	T148358 Prepare for November user research
Resolved	• atgo	T148359 Complete Wikilater prototype for testing
Declined	None	T149627 Update Wikilater to use Electron service in production
Resolved	None	T148364 Complete Quickfact prototype for testing
Declined	None	T149628 Update Flashcard to use Electron service in production
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved	None	T135643 Show tables in pdfs (#9)
Resolved	WMDE-Fisch	T135616 Investigate underlying issues with tables in PDF rendering
Resolved	Addshore	T135613 [GTWL] Include hint about excluded tables when generating a PDF
Invalid	None	T137431 Improve patch to show tables in pdfs
Invalid	None	T137432 Add a table appendix to pdfs
Resolved	Tobi_WMDE_SW	T142201 Create a mediawiki extension for browser-based rendered pdf support
Resolved	Tobi_WMDE_SW	T142202 Only render single articles, and not collections or books
Resolved	gabriel-wmde	T142204 Investigate what is needed to use browser based rendering for books
Resolved	Addshore	T145413 Request repository for browser-based rendered pdf support extension
Resolved	Tobi_WMDE_SW	T146894 Replace "Printable version" with link to ElectronPdfService
Resolved	Tobi_WMDE_SW	T146895 Implement SpecialPage according to mockup
Resolved	Tobi_WMDE_SW	T147842 Update extension documentation on mw.org
Declined	None	T149086 Implement caching for ElectronPdfService extension
Resolved	Tobi_WMDE_SW	T149189 Add and enable basic browsertests
Resolved	• GWicke	T142226 Productize the Electron PDF render service & create a REST API end point
Resolved	• mobrovac	T143129 New service request - PDF Render
Resolved	• Pchelolo	T143132 Expose the PDF rendering service via RESTBase
Resolved	• mobrovac	T143336 Investigate better protection modes for electron render service (xvfb setuid)
Resolved	• dpatrick	T148567 Restrict outgoing network connections from Electron render service
Resolved	• dpatrick	T148576 Security review request: Electron render service
Resolved	Lea_WMDE	T143410 Voices from the Community about current pdf use
Resolved	Addshore	T150326 Track numbers for Electron- vs. OCG-Rendering

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: Malyacko, Aklapper. · View Herald TranscriptAug 5 2016, 5:12 PM

JeanFred subscribed.Aug 5 2016, 7:04 PM

Lea_WMDE added a parent task: T135643: Show tables in pdfs (#9).Aug 8 2016, 10:13 AM

Lea_WMDE updated the task description. (Show Details)Aug 8 2016, 1:42 PM

For the deployment part, I guess we'll need to follow our own guidelines (if only in a redux form, since we'll be running a third-party service) :)

Lea_WMDE mentioned this in T135643: Show tables in pdfs (#9).Aug 8 2016, 2:35 PM

• GWicke updated the task description. (Show Details)Aug 8 2016, 4:04 PM

MoritzMuehlenhoff subscribed.Aug 8 2016, 4:26 PM

Chrome/Chromium has a very fast-moving release cycle with updates every few weeks (and sometimes even with only a week inbetween releases), how is Electron keeping up/in sync with Chromium. It sounds as if it will be non-trivial to keep Electron and Chromium in sync/compatible. This year has seen 10 DSAs for Chromium in stable so far.

@MoritzMuehlenhoff, that's one of the reasons why Electron is distributing their own binary build of the exact Chromium version they are supporting. They are following Chrome stable fairly closely. The current build is using Chrome 52.

Edit: Also look for "chrome" in https://github.com/electron/electron/releases.

Is there a timeline for finishing the MediaWiki-integration work and replacing OCG? Should we align such a timeline with the deployment of this service? (i.e. is there any particular reason to deploy this ASAP, rather than wait for the rest of this product's components to be done?)

How are we making sure that we're not going to end up with both this and OCG, both half-baked, in prod at the same time? Is someone owning the transition end-to-end?

(This is -at least- the third incarnation of PDF rendering. I hope you can understand my skepticism. This definitely looks interesting from a technology perspective, though — and thanks for working on that!)

I’m not sure if we are ready for a timeline just yet. At this point, what we know is this:

There is one community (de-wiki) who had a unanimous vote LINK to support the current version of the Electron rendered PDF, even if no further styling is done, as a parallel, complementary option
There are quite some more people upset about the fact that the current OCG version misses certain features such as tables (see e.g. the discussion on Jimbo’s talk page)
This also seems to come up as an issue at least to the German support team multiple times a week

By implementing the solution ASAP, we don’t spend a terrible amount of resources. However, we gain

at least one thankful community (but probably more)
relaxation for the support teams, even before this is solved „once and for all“
more insights into what people actually want when clicking „download this as pdf“
momentum behind improving the printing layout
a better basis to take a decision how to proceed in the future

Another thing to factor in is that often, the discussion is focused on the 'PDF rendering' and less on the fact that OCG and its predecessor were used for printing "collections" of pages. That is the killer feature of that product. Till that feature is replicated, I don't see OCG getting replaced. On the other hand, OCG doesn't look like it will have full table support either (except incremental support).

So, neither option (Electron PDF rendering or OCG as it exists today) are going to be adequate replacements, but something that brings those together probably will (or if someone decides a replication of book printing ability around the Electron PDF rendering ability).

Anyway, I think this still needs a product owner (not just a technology owner of one part of the product) to look at this as a full product offering and figuring out where to go with this. Maybe WMDE is in a position to take that on?

greg subscribed.Aug 9 2016, 4:20 PM

At our discussion at Wikimania, I made the case that the wikibooks use cases might be better served by a more book-specific tool. This could just be OCG (or something based on the kiwix bundler) run as a CLI (or a service) on a labs instance, and could provide access to the intermediate LaTeX source for full layout control. If you look at https://en.wikibooks.org/wiki/Wikibooks:Featured_books, all those are already linking to a PDF uploaded to commons. A community-maintained tool should fit well into this existing workflow, and would likely spur innovation in LaTeX-based book render tools.

Additionally, basically all books already provide a page transcluding all chapters into a single "printable" HTML page, which can be printed through the browser-based PDF render service. All but one featured wikibook I tried finish browser-based rendering in well under one minute [1]. The exception is https://en.wikipedia.org/wiki/Book:Health_care, which prints to 1900+ pages, and doesn't finish rendering using OCG either. A future iteration of collections or reading lists could follow the "printable page" approach as well. Alternatively, chapters could be converted to PDF individually, and PDF tools could be used to combine them to a single large PDF.

In any case, unless a team clearly makes the case for OCG as a product & takes on full ownership, I don't see how we can keep OCG in production in the medium term.

[1]: Timings from the hackathon notes, using a small labs instance:

Wikibooks LaTeX: 140 pages, ~16s render time in labs
Wikibooks C programming: 198 pages, ~12s render time in labs
Wikibooks Control Systems: 147 papes, 25s render time
Wikibooks Haskell: 440 pages, 36s render time
Enwiki Barack Obama: 47 pages, 5s render time
Book with loads of pages: Health Care (~1900 pages, static test copy): Does not render within timeout, but timeout triggers reliably & frees resources.

In T142226#2537844, @GWicke wrote:

At our discussion at Wikimania, I made the case that the wikibooks use cases might be better served by a more book-specific tool. This could just be OCG (or something based on the kiwix bundler) run as a CLI (or a service) on a labs instance, and could provide access to the intermediate LaTeX source for full layout control. If you look at https://en.wikibooks.org/wiki/Wikibooks:Featured_books, all those are already linking to a PDF uploaded to commons. A community-maintained tool should fit well into this existing workflow, and would likely spur innovation in LaTeX-based book render tools.
...
...
In any case, unless a team clearly makes the case for OCG as a product & takes on full ownership, I don't see how we can keep OCG in production in the medium term.

All this is fine. I just indicated that there needs to be someone who takes ownership of this books + pdf-rendering product and engage with the existing user base and communicate upcoming changes to the product (if, for example, OCG is being shut down), and adjust expectations accordingly (one of the many things you outlined above). As the last person stuck with the OCG product, I don't want @cscott at the receiving end and getting embroiled in unnecessary heated debate if / when OCG shuts down. So, there needs to be clear process and handling of this. Either both products stay alive in whatever limbo state or someone takes ownership of the full product space and charts a forward path. I have expressed clearly in an earlier email thread that OCG will be primarily in maintenance mode (barring minor incremental things at hackathons, etc.) and Scott shouldn't undertake major new development on OCG without there being a product owner for it. That is the extent of what I can do given current state of affairs.

Just to be clear, I am not opposed to the new service. The output looks good and we've always known that browser-based rendering might be the solution going forward. Scott agrees as well. I was just suggesting that the path to getting to a single product isn't clear and unless someone takes up product ownership for this space, I don't know who will make the call for turning off OCG in favour of this new thing (unless it comes from users themselves).

I don't know who will make the call for turning off OCG in favour of this new thing (unless it comes from users themselves).

One important input for justifying spending donor money is usage data. We should learn fairly quickly whether most users are satisfied with browser-based rendering, and what & how frequent the remaining use cases & pain points are.

While this doesn't magically make the decision by itself, it should at least make it easier to come to a well-informed agreement.

WMDE-leszek subscribed.Aug 10 2016, 9:29 AM

• mobrovac added a project: User-mobrovac.Aug 10 2016, 4:55 PM

• Gilles subscribed.Aug 10 2016, 5:46 PM

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

• RobLa-WMF subscribed.Aug 10 2016, 6:05 PM

I'm with Subbu here, and specifically:

In T142226#2538081, @ssastry wrote:

All this is fine. I just indicated that there needs to be someone who takes ownership of this books + pdf-rendering product and engage with the existing user base and communicate upcoming changes to the product (if, for example, OCG is being shut down), and adjust expectations accordingly (one of the many things you outlined above). As the last person stuck with the OCG product, I don't want @cscott at the receiving end and getting embroiled in unnecessary heated debate if / when OCG shuts down. So, there needs to be clear process and handling of this. Either both products stay alive in whatever limbo state or someone takes ownership of the full product space and charts a forward path. I have expressed clearly in an earlier email thread that OCG will be primarily in maintenance mode (barring minor incremental things at hackathons, etc.) and Scott shouldn't undertake major new development on OCG without there being a product owner for it. That is the extent of what I can do given current state of affairs.

Adding a new service into the mix now feels like working around the problem and offering two half-baked solutions to our users. It doesn't make sense neither from a user/product perspective nor from an infrastructure perspective. it's great that Services is offering to own this, but they shouldn't feel the need to carry this burden alone, and we (incl. SRE) shouldn't operate yet another service without phasing out something else _and_ not offering anything new to users at the same time.

Finally, having more data about collections vs. single-page renderings is certainly very useful input (thanks @GWicke!), but I fear we're going to circle back to product ownership again. Who's going to interpret the data? Who's deciding where we draw the line and which threshold is low-enough to kill collections? What if the numbers are too low because the product sucks and noone can use it, but is otherwise a good product to offer? These are all product decisions and we shouldn't we, in tech/infra, decide on those matters, IMHO.

• mobrovac moved this task from Backlog to Next on the Services board.Aug 10 2016, 7:27 PM

@ssastry The problem on WMDE side is that from autumn on I'm going to be on a leave for 5 months - so product management capacities are rare here these days.

However, I'm not sure if the introduction of an Electron service should be seen as a half-baked solution. The current situation is highly unsatisfactory to users. The introduction of the new service will not create a perfect-world situation immediately. But comparing the situation then to the situation now, I feel like people would see the new situation as a big improvement already - at least these were the responses when talking to the German community. Holding back on improving the situation (even if it is not perfect yet), does not seem like a good solution to me.

Coming back to the product ownership: Since there are people from different teams involved, maybe the solution would be some kind of a "triangle" ownership, held by all involved teams together?

@Lea_WMDE, we (the Services team) have already volunteered to take on technical and operational ownership of the service itself. However, what this project (or, rather, endeavour) lacks is product ownership and a clear plan and guidelines to move forward (complete with feature development, community engagement, etc). IMHO, it would not be good to split it amongst multiple teams and actors, as it would get us into the situation we are in right now with OCG (everybody is responsible, but nobody is at the same time).

If you take this on, would somebody from WMDE be able to take the lead during your absence?

Tobi_WMDE_SW subscribed.Aug 15 2016, 11:43 AM

If you take this on, would somebody from WMDE be able to take the lead during your absence?

@mobrovac Since it is vacation time here, I won't be able to answer that question very soon.

But what are other people's take on having a split ownership which works better than it does now for OCG?

I think we should avoid getting too far ahead of ourselves and making guesses about what the community will/will not find acceptable or "an improvement". Let's deploy some subset of the service to some subset of our users and find out what they actually think. In the best case they will identify the missing features they care most about and be motivated to help develop them (either within the OCG framework or the Electron framework).

• mobrovac mentioned this in T143129: New service request - PDF Render.Aug 16 2016, 5:43 PM

• mobrovac added a subtask: T143129: New service request - PDF Render.

• mobrovac added a subtask: T143132: Expose the PDF rendering service via RESTBase.Aug 16 2016, 5:47 PM

• dpatrick edited projects, added deprecated-security-team-reviews; removed acl*security.Aug 16 2016, 8:34 PM

-jem- subscribed.Aug 17 2016, 8:09 AM

Mentioned in SAL [2016-08-17T21:47:24Z] <cscott> updated OCG to version e3e0fd015ad8fdbf9da1838c830fe4b075c59a29 (T133001, T142226)

In T142226#2540859, @GWicke wrote:

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

@cscott merged & deployed this today. The metrics are available at https://grafana-admin.wikimedia.org/dashboard/db/collection_use. So far, about 97% of PDF requests are for single pages.

In T142226#2563267, @GWicke wrote:

In T142226#2540859, @GWicke wrote:

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

@cscott merged & deployed this today. The metrics are available at https://grafana-admin.wikimedia.org/dashboard/db/collection_use. So far, about 95-96% of PDF requests are for single pages.

Great! As I said on IRC earlier today, it would be good to separate stats by user-agent, i.e. what proportion of these requests are bot / crawler driven?

• mobrovac added a subtask: T143336: Investigate better protection modes for electron render service (xvfb setuid).Aug 18 2016, 4:55 PM

• Elitre subscribed.Aug 18 2016, 5:33 PM

Arlolra subscribed.Sep 16 2016, 12:29 AM

• GWicke edited projects, added Services (next); removed Services.Oct 12 2016, 3:31 PM

• GWicke triaged this task as Medium priority.Oct 12 2016, 6:00 PM

• GWicke edited projects, added Services (blocked); removed Services (next).

• GWicke mentioned this in T134205: Options for browser-based server-side PDF generation .Oct 12 2016, 10:05 PM

• GWicke mentioned this in T78579: SVG to PNG conversion, minimization, sanitization service.Oct 12 2016, 10:43 PM

• GWicke added a project: Electron-PDFs.Oct 17 2016, 2:37 PM

• atgo mentioned this in T148364: Complete Quickfact prototype for testing.Oct 17 2016, 2:49 PM

• atgo mentioned this in T148359: Complete Wikilater prototype for testing.

• atgo subscribed.