Page MenuHomePhabricator

Productize the Electron PDF render service & create a REST API end point
Closed, ResolvedPublic

Description

In T134205 we tested different browser-based PDF rendering options, and identified the electron render service as a clear winner. Wikimedia Germany has been working on improving table support (see T73808), which is a community wishlist item from the German Wikipedia. They have since asked the German community for feedback on the Electron rendering, and so far there is unanimous support.

As a next step, the WMDE-TechWish is looking to offer the Electron based rendering as part of the "This page as PDF" functionality. To support this, we need

  • a production deploy of the electron rendering service, and
  • a production API end point.

Service deployment

The electron render service is a stateless third party node service based on Electron / Chrome. Resource usage is fairly moderate, with most pages taking 1-2s to render. Based on OCG request rates, we expect about 2 req/s initially. Each render worker (fixed number) typically uses ~120-200m RAM, peaking at ~500m for really large documents. Resource usage is bounded primarily with a configurable render timeout, which in stress testing triggered reliably & immediately freed resources. Given limited resource usage and stateless operation, the most obvious deploy target would be the SCB cluster.

The service's NPM install pulls in a binary Electron build from upstream. While this is not ideal, it is partly reflective of the fast pace of Electron development. Packaging Electron as a deb would likely be non-trivial, as it essentially involves a full build of Chromium & all its dependencies. For now, the easiest option will be to check the binary dependency into the deploy repository, in line with other binary modules.

Security considerations

The underlying rendering engine (Chromium) is a complex piece of software with a large attack surface, but has many layered security measures in place to prevent attacks. In combination with firejail and systemd limits, the risk of local exploits should be fairly low. Setting up firejail with X11 / xvfb support is a bit tricky. Options for doing so as well as other options for locking down Electron further are discussed in T143336.

The service loads a HTML page given by a supplied URL, and then loads any resources linked from that HTML, as a browser would. The service will only be exposed through a "this page as HTML" API, which means that we can & will restrict the loaded pages to sanitized article HTML. While sanitization ensures that this HTML does not contain references to resources on 10.* IPs, it would be good to not rely on this exclusively. An option for restricting access from this service to public IPs would be to set up an iptables rule matching on the service user, and dropping any requests to the private production IPs. Another might be to use a proxy, although this would likely affect performance negatively. This sub-issue is tracked in T148567.

Public API & caching

An obvious place for a PDF render end point is /api/rest_v1/page/pdf/{title}, in line with other formats like html, data-parsoid, mobile sections etc. @Pchelolo already has already prototyped a spec for this end point.

Given the fairly efficient render backend & low expected request volumes, basic Varnish caching & a relatively low per-IP rate limit should be sufficient to ensure reliable operation. Initially, a relatively low TTL & no purging should be sufficient. If request volume becomes an issue, we can move to active purging & longer TTLs.

Ownership

We (Services) will own the backend service & API end point. As this is a generic / stateless third party service under active development, we expect it to require little ongoing maintenance effort. The service also supports rasterizing web content including SVGs, which might come in handy for other internal uses (SVG to PNG, visual diffing) in the future.

Wikimedia Germany's WMDE-TechWish is looking into exposing "This page as PDF" functionality in the UI. The WMF #reading team & the WMDE-TechWish are working on improving print styling in general: T135022, T142207.

Other notes

Related Objects

StatusSubtypeAssignedTask
Resolved Jhernandez
Resolved atgo
DeclinedNone
ResolvedNone
DeclinedNone
Resolved JKatzWMF
ResolvedNone
ResolvedWMDE-Fisch
ResolvedAddshore
InvalidNone
InvalidNone
ResolvedTobi_WMDE_SW
ResolvedTobi_WMDE_SW
Resolvedgabriel-wmde
ResolvedAddshore
ResolvedTobi_WMDE_SW
ResolvedTobi_WMDE_SW
ResolvedTobi_WMDE_SW
DeclinedNone
ResolvedTobi_WMDE_SW
Resolved GWicke
Resolved mobrovac
Resolved Pchelolo
Resolved mobrovac
Resolved dpatrick
Resolved dpatrick
ResolvedLea_WMDE
ResolvedAddshore

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For the deployment part, I guess we'll need to follow our own guidelines (if only in a redux form, since we'll be running a third-party service) :)

Chrome/Chromium has a very fast-moving release cycle with updates every few weeks (and sometimes even with only a week inbetween releases), how is Electron keeping up/in sync with Chromium. It sounds as if it will be non-trivial to keep Electron and Chromium in sync/compatible. This year has seen 10 DSAs for Chromium in stable so far.

@MoritzMuehlenhoff, that's one of the reasons why Electron is distributing their own binary build of the exact Chromium version they are supporting. They are following Chrome stable fairly closely. The current build is using Chrome 52.

Edit: Also look for "chrome" in https://github.com/electron/electron/releases.

Is there a timeline for finishing the MediaWiki-integration work and replacing OCG? Should we align such a timeline with the deployment of this service? (i.e. is there any particular reason to deploy this ASAP, rather than wait for the rest of this product's components to be done?)

How are we making sure that we're not going to end up with both this and OCG, both half-baked, in prod at the same time? Is someone owning the transition end-to-end?

(This is -at least- the third incarnation of PDF rendering. I hope you can understand my skepticism. This definitely looks interesting from a technology perspective, though — and thanks for working on that!)

I’m not sure if we are ready for a timeline just yet. At this point, what we know is this:

  • There is one community (de-wiki) who had a unanimous vote LINK to support the current version of the Electron rendered PDF, even if no further styling is done, as a parallel, complementary option
  • There are quite some more people upset about the fact that the current OCG version misses certain features such as tables (see e.g. the discussion on Jimbo’s talk page)
  • This also seems to come up as an issue at least to the German support team multiple times a week

By implementing the solution ASAP, we don’t spend a terrible amount of resources. However, we gain

  • at least one thankful community (but probably more)
  • relaxation for the support teams, even before this is solved „once and for all“
  • more insights into what people actually want when clicking „download this as pdf“
  • momentum behind improving the printing layout
  • a better basis to take a decision how to proceed in the future

Another thing to factor in is that often, the discussion is focused on the 'PDF rendering' and less on the fact that OCG and its predecessor were used for printing "collections" of pages. That is the killer feature of that product. Till that feature is replicated, I don't see OCG getting replaced. On the other hand, OCG doesn't look like it will have full table support either (except incremental support).

So, neither option (Electron PDF rendering or OCG as it exists today) are going to be adequate replacements, but something that brings those together probably will (or if someone decides a replication of book printing ability around the Electron PDF rendering ability).

Anyway, I think this still needs a product owner (not just a technology owner of one part of the product) to look at this as a full product offering and figuring out where to go with this. Maybe WMDE is in a position to take that on?

At our discussion at Wikimania, I made the case that the wikibooks use cases might be better served by a more book-specific tool. This could just be OCG (or something based on the kiwix bundler) run as a CLI (or a service) on a labs instance, and could provide access to the intermediate LaTeX source for full layout control. If you look at https://en.wikibooks.org/wiki/Wikibooks:Featured_books, all those are already linking to a PDF uploaded to commons. A community-maintained tool should fit well into this existing workflow, and would likely spur innovation in LaTeX-based book render tools.

Additionally, basically all books already provide a page transcluding all chapters into a single "printable" HTML page, which can be printed through the browser-based PDF render service. All but one featured wikibook I tried finish browser-based rendering in well under one minute [1]. The exception is https://en.wikipedia.org/wiki/Book:Health_care, which prints to 1900+ pages, and doesn't finish rendering using OCG either. A future iteration of collections or reading lists could follow the "printable page" approach as well. Alternatively, chapters could be converted to PDF individually, and PDF tools could be used to combine them to a single large PDF.

In any case, unless a team clearly makes the case for OCG as a product & takes on full ownership, I don't see how we can keep OCG in production in the medium term.

[1]: Timings from the hackathon notes, using a small labs instance:

  • Wikibooks LaTeX: 140 pages, ~16s render time in labs
  • Wikibooks C programming: 198 pages, ~12s render time in labs
  • Wikibooks Control Systems: 147 papes, 25s render time
  • Wikibooks Haskell: 440 pages, 36s render time
  • Enwiki Barack Obama: 47 pages, 5s render time
  • Book with loads of pages: Health Care (~1900 pages, static test copy): Does not render within timeout, but timeout triggers reliably & frees resources.

At our discussion at Wikimania, I made the case that the wikibooks use cases might be better served by a more book-specific tool. This could just be OCG (or something based on the kiwix bundler) run as a CLI (or a service) on a labs instance, and could provide access to the intermediate LaTeX source for full layout control. If you look at https://en.wikibooks.org/wiki/Wikibooks:Featured_books, all those are already linking to a PDF uploaded to commons. A community-maintained tool should fit well into this existing workflow, and would likely spur innovation in LaTeX-based book render tools.
...
...
In any case, unless a team clearly makes the case for OCG as a product & takes on full ownership, I don't see how we can keep OCG in production in the medium term.

All this is fine. I just indicated that there needs to be someone who takes ownership of this books + pdf-rendering product and engage with the existing user base and communicate upcoming changes to the product (if, for example, OCG is being shut down), and adjust expectations accordingly (one of the many things you outlined above). As the last person stuck with the OCG product, I don't want @cscott at the receiving end and getting embroiled in unnecessary heated debate if / when OCG shuts down. So, there needs to be clear process and handling of this. Either both products stay alive in whatever limbo state or someone takes ownership of the full product space and charts a forward path. I have expressed clearly in an earlier email thread that OCG will be primarily in maintenance mode (barring minor incremental things at hackathons, etc.) and Scott shouldn't undertake major new development on OCG without there being a product owner for it. That is the extent of what I can do given current state of affairs.

Just to be clear, I am not opposed to the new service. The output looks good and we've always known that browser-based rendering might be the solution going forward. Scott agrees as well. I was just suggesting that the path to getting to a single product isn't clear and unless someone takes up product ownership for this space, I don't know who will make the call for turning off OCG in favour of this new thing (unless it comes from users themselves).

I don't know who will make the call for turning off OCG in favour of this new thing (unless it comes from users themselves).

One important input for justifying spending donor money is usage data. We should learn fairly quickly whether most users are satisfied with browser-based rendering, and what & how frequent the remaining use cases & pain points are.

While this doesn't magically make the decision by itself, it should at least make it easier to come to a well-informed agreement.

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

I'm with Subbu here, and specifically:

All this is fine. I just indicated that there needs to be someone who takes ownership of this books + pdf-rendering product and engage with the existing user base and communicate upcoming changes to the product (if, for example, OCG is being shut down), and adjust expectations accordingly (one of the many things you outlined above). As the last person stuck with the OCG product, I don't want @cscott at the receiving end and getting embroiled in unnecessary heated debate if / when OCG shuts down. So, there needs to be clear process and handling of this. Either both products stay alive in whatever limbo state or someone takes ownership of the full product space and charts a forward path. I have expressed clearly in an earlier email thread that OCG will be primarily in maintenance mode (barring minor incremental things at hackathons, etc.) and Scott shouldn't undertake major new development on OCG without there being a product owner for it. That is the extent of what I can do given current state of affairs.

Adding a new service into the mix now feels like working around the problem and offering two half-baked solutions to our users. It doesn't make sense neither from a user/product perspective nor from an infrastructure perspective. it's great that Services is offering to own this, but they shouldn't feel the need to carry this burden alone, and we (incl. SRE) shouldn't operate yet another service without phasing out something else _and_ not offering anything new to users at the same time.

Finally, having more data about collections vs. single-page renderings is certainly very useful input (thanks @GWicke!), but I fear we're going to circle back to product ownership again. Who's going to interpret the data? Who's deciding where we draw the line and which threshold is low-enough to kill collections? What if the numbers are too low because the product sucks and noone can use it, but is otherwise a good product to offer? These are all product decisions and we shouldn't we, in tech/infra, decide on those matters, IMHO.

@ssastry The problem on WMDE side is that from autumn on I'm going to be on a leave for 5 months - so product management capacities are rare here these days.

However, I'm not sure if the introduction of an Electron service should be seen as a half-baked solution. The current situation is highly unsatisfactory to users. The introduction of the new service will not create a perfect-world situation immediately. But comparing the situation then to the situation now, I feel like people would see the new situation as a big improvement already - at least these were the responses when talking to the German community. Holding back on improving the situation (even if it is not perfect yet), does not seem like a good solution to me.

Coming back to the product ownership: Since there are people from different teams involved, maybe the solution would be some kind of a "triangle" ownership, held by all involved teams together?

@Lea_WMDE, we (the Services team) have already volunteered to take on technical and operational ownership of the service itself. However, what this project (or, rather, endeavour) lacks is product ownership and a clear plan and guidelines to move forward (complete with feature development, community engagement, etc). IMHO, it would not be good to split it amongst multiple teams and actors, as it would get us into the situation we are in right now with OCG (everybody is responsible, but nobody is at the same time).

If you take this on, would somebody from WMDE be able to take the lead during your absence?

If you take this on, would somebody from WMDE be able to take the lead during your absence?

@mobrovac Since it is vacation time here, I won't be able to answer that question very soon.

But what are other people's take on having a split ownership which works better than it does now for OCG?

I think we should avoid getting too far ahead of ourselves and making guesses about what the community will/will not find acceptable or "an improvement". Let's deploy some subset of the service to some subset of our users and find out what they actually think. In the best case they will identify the missing features they care most about and be motivated to help develop them (either within the OCG framework or the Electron framework).

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

@cscott merged & deployed this today. The metrics are available at https://grafana-admin.wikimedia.org/dashboard/db/collection_use. So far, about 97% of PDF requests are for single pages.

To get information on the relative frequency of single-page vs. multi-page collection requests, I created a patch adding a statsd metric in OCG: https://gerrit.wikimedia.org/r/304043

@cscott merged & deployed this today. The metrics are available at https://grafana-admin.wikimedia.org/dashboard/db/collection_use. So far, about 95-96% of PDF requests are for single pages.

Great! As I said on IRC earlier today, it would be good to separate stats by user-agent, i.e. what proportion of these requests are bot / crawler driven?

GWicke triaged this task as Medium priority.Oct 12 2016, 6:00 PM
GWicke edited projects, added Services (blocked); removed Services (next).

New-Readers is also investigating prototypes using PDFs for offline support, FYI.

EDIT: the way the service is accessed via MediaWiki is different than what I understood discussing it on IRC, and it's not relevant here anymore.

GWicke claimed this task.

The electron render service is deployed & exposed via the REST API: https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_pdf_title

Thanks to everyone who contributed!