Page MenuHomePhabricator

Trial replacing Electron with headless Chromium in the render service
Closed, ResolvedPublic13 Story Points

Description

We'd like to know whether headless Chromium can handle production traffic better than Electron can, in terms of stability and resource consumption. We (Readers Web) hope to switch out Electron for headless Chromium and monitor its performance in production for two weeks.

An initial complication is that the service isn't a Node.js HTTP server that drives an instance of Electron; it's a command line Electron app that runs an HTTP server as well. Fortunately, the the service has a well-defined, small interface:

Approaches

1. Add a headless Chromium driver to the service

As noted above, the service is a command line Electron app that runs an HTTP server as well. A lot of the basic infrastructure that's required to serve ~60 k requests/day is already in place, like a fixed size renderer pool with an internal queue.

Pros
  • It's deployed!
    • i.e. it's already had initial security review by Security and Services. New additions and libraries will need additional review but this is, generally speaking, less work for all parties.
  • The prerequisite infrastructure is already in place (renderer pooling, queuing, timeouts, basic performance monitoring).
  • Esoteric rendering errors have already been handled!.
Cons
  • Since the service is specific to Electron, there's little separation of concerns between Electron-specific and Node.js HTTP server specific code – hashtag technical debt
    • Introducing/rolling back changes is a little trickier.
  • The service is very general but is only used in one specific way. Shimming in a single-purpose renderer might be awkward without paring down the service first.

2. Create a new service that drives headless Chromium

Pros
  • A neat switcharoo!
    • We could even forward X% of requests at the RESTBase level so that we can do science to it.
  • Less complexity as it only has to implement the interface.
  • Easier to iterate on.
  • The service can and should be built atop wikimedia/service-template-node.
Cons
  • Will require additional review from Security and Services in order to deploy it.

3. Create a new service that's a slave to the existing service

In T176627#3636569, @pmiazga suggests that we can create the new service in #2 but slave it to the existing service so that it does the same work but its output isn't ever consumed by the user.

Pros
  • No impact on existing UX (good or bad) while we're trialling the service.
  • We can compare the performance of the two services side-by-side.
  • Easier to iterate on.
Cons
  • Will require additional review from Security and Services in order to deploy it.
    • This approach may have less impact but this review still blocks deployment.
  • May require changes to the existing service unless we can proxy requests to both services.

Immediate Outcomes

Notes

  1. The Electron render service source is here: https://github.com/wikimedia/mediawiki-services-electron-render

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

During this sprint, we're going to be discussing the approaches (reviewing T172815: Improve stability and maintainability of our browser-based PDF render service and likely reviewing the existing codebase) and breaking down the work into a plan that we can agree on and can present to @ovasileva.

I'd like to point out that the option #1 will probably require a security review too. For that reason, we should go with the option #2 given its pros and cons. I'll create a proof-of-concept repo for option #2.

phuedx added a comment.EditedOct 5 2017, 1:33 PM

@bmansurov: I would recommend basing your work atop the wikimedia/service-template-node repo.

@phuedx, thanks. That's what I'm doing.

Here's what I came up with: https://github.com/kodchi/mediawiki-services-headless-chromium

As a next step we should look into deploying this to a staging server.

Anyone knows what version of node is supported in our infrastructure? puppeteer requrires version 6.4.0, and the above repo uses async/await so, we'll need at least version 7.6.0. How flexible are we in installing the required node version?

phuedx added a comment.Oct 6 2017, 9:44 AM

Anyone knows what version of node is supported in our infrastructure? puppeteer requrires version 6.4.0, and the above repo uses async/await so, we'll need at least version 7.6.0. How flexible are we in installing the required node version?

I can't find anything on Wikitech around the production environment, so that's a question for Services (@GWicke is already subscribed to this task). OTOH you can install what you need on a Wikimedia Cloud VPS instance 🙂

Thanks, @phuedx. Our Cloud project has reached its limit of 8 instances and requesting a new project will take some time I guess. Or is there a faster way?

Also, isn't the goal to push this to production so that we can stress test it? Or are we fine with having it on staging server and testing it internally?

phuedx added a comment.Oct 6 2017, 3:26 PM

Thanks, @phuedx. Our Cloud project has reached its limit of 8 instances and requesting a new project will take some time I guess. Or is there a faster way?

Might be time to do an audit. Can you send an email to reading-web-team@ and ask if there's an instance that can be removed?

Also, isn't the goal to push this to production so that we can stress test it? Or are we fine with having it on staging server and testing it internally?

I think stress testing with production traffic is the target. However, we have to acknowledge that'll require security review and help from Services to get the service deployed.

There's a bit of spare capacity in the Mobile project, and a good amount if the perf-testing instance there is no longer needed. I can add folks as members/admins as needed if you want to go that route.

Anyone knows what version of node is supported in our infrastructure? puppeteer requrires version 6.4.0, and the above repo uses async/await so, we'll need at least version 7.6.0. How flexible are we in installing the required node version?

We're currently on 6.11. I believe the plan is to upgrade to node 8 when it hits LTS (scheduled for 2017-10-31) though of course @GWicke is the authority on that.

Thanks, @Mholloway. I've found space in our VPS project for now. Also, thanks for the info on node. I will re-write my code to be compatible with it, so that we're not blocked on upgrading node.

bmansurov added a comment.EditedOct 6 2017, 10:52 PM

Status update: I've set up http://chromium-pdf.wmflabs.org with Node.js as a reverse proxy to node running as a pm2 process. But I'm unable to render a PDF at http://chromium-pdf.wmflabs.org/en.wikipedia.org/v1/pdf/Berlin/Letter for example. I'll see on Monday if I can find the cause.

Error:

1# debian jessie, node 6
2
3{"name":"mediawiki-services-headless-chromium","hostname":"chromium-pdf","pid":17789,"level":60,"err":{"message":"kill ESRCH","name":"Error","stack":"Error: kill ESRCH\n at exports._errnoException (util.js:1018:11)\n at process.kill (internal/process.js:190:13)\n at killChrome (/srv/mediawiki-services-headless-chromium/node_modules/puppeteer/node6/Launcher.js:148:19)\n at Function.<anonymous> (/srv/mediawiki-services-headless-chromium/node_modules/puppeteer/node6/Launcher.js:134:7)\n at throw (native)\n at step (/srv/mediawiki-services-headless-chromium/node_modules/puppeteer/node6/Launcher.js:62:24)\n at Promise.resolve.then.err (/srv/mediawiki-services-headless-chromium/node_modules/puppeteer/node6/Launcher.js:76:15)\n at process._tickCallback (internal/process/next_tick.js:109:7)","code":"ESRCH","errno":"ESRCH","syscall":"kill","levelPath":"fatal/service-runner/unhandled"},"msg":"kill ESRCH","time":"2017-10-09T15:21:37.320Z","v":0}

phuedx added a comment.Oct 9 2017, 4:02 PM

@bmansurov: It looks like the Chromium process has exited before puppeteer tries to kill it.

Thanks, @phuedx. It's all good now. I've disabled sandboxing. We'll have to use firejail for that purpose.

phuedx added a comment.Oct 9 2017, 4:47 PM

Just a note that we might want to name the Gerrit repository mediawiki-services-render-chromium so that it mirrors the existing service.

phuedx added a comment.EditedOct 9 2017, 5:42 PM

@bmansurov asked about next steps in today's standup meeting. I suggested that we might try playing some production traffic against the server. We could fetch a sample of production traffic from the wmf.webrequest table in Hive and then make equivalent requests to the service.

bmansurov added a comment.EditedOct 9 2017, 7:57 PM

Just a note that we might want to name the Gerrit repository mediawiki-services-render-chromium so that it mirrors the existing service.

Done

Edit: the service should have been named mediawiki/services/chromium-render similar to https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/electron-render. I have requested to rename the project.

@bmansurov asked about next steps in today's standup meeting. I suggested that we might try playing some production traffic against the server. We could fetch a sample of production traffic from the wmf.webrequest table in Hive and then make equivalent requests to the service.

I've had some issues with running hive queries, so I took some articles from the list of long pages. The pages I tried may not be exactly the pages Electron had trouble with, but those were long pages. For example, List_of_compositions_by_Franz_Schubert turned out to be an 82 page PDF. I also tried finding pages that had images.

I also took some pages from Category:Pages_with_too_many_photos and the results were encouraging (very fast renders). For example, History_of_construction and Wells,_Somerset rendered very fast (around 3 secs).

Pages with math formulas also render fast. Some pages from List_of_equations were tested and the results rendered fast.

bmansurov added a comment.EditedOct 9 2017, 9:48 PM

Having done the #2 from the description, I think we should go with option #1. The problem with the option #2 is that we have to write a lot of boilerplate just to get to a place where RESTBase is now already. We also have to have repositories created, have have them reviewed. With option #1, we still can have clean separation from Electron as we'll be able to create a separate end point fro headless Chromium. We'll also have to have only headless Chromium security-reviewed (as we're not creating a new repository). Ops will also have to have worry about one fewer service to maintain.

That's why I propose we port the new service to RESTBase. What do you all think?

Edit: Having talked to @phuedx, I think it's best to leave the service as is as I got confused with adding it to the existing Electron vs RESTBase.

IRC scrollback for @bmansurov's edited comment above:

14:36:57 <phuedx|afk> bmansurov: do you mean implementing it as part of mediawiki-services-render-electron or actually in restbase?
14:37:47 <bmansurov> i thought restabase
14:38:28 <phuedx|afk> i don't think services would buy that -- restbase is a caching proxy
14:38:46 <phuedx|afk> i don't think they'd like making it launch chromium :D
14:39:00 <bmansurov> i see, then doing it as part of electron makes little sense
14:39:09 <bmansurov> i think we should keep the existing service then

My concerns around https://phabricator.wikimedia.org/T176627#3636032 still apply but it sounds like we will seek security review regardless so that's good (and we can control the deploy of the service).

That aside, the repo is called mediawiki-services-electron-render so may add confusion if we remove all the Electron related code from the service. I'm not sure how easy it is to rename an existing service later on. A more generic name such as mediawiki-services-pdf-printing would have been a better choice on hindsight.

The problem with the option #2 is that we have to write a lot of boilerplate just to get to a place where RESTBase is now already

It's quite straight forward - mostly a clone. When building https://github.com/wikimedia/mediawiki-services-trending-edits services were quite helpful and did most of the tricky stuff. @bearND may also be able to help. We may also benefit from using the more up to date boilerplate by allowing ourselves to use ES6. This may save us hassle on the long term.

Thanks for chiming in, Jon. I think the repo is in a good state for now. I'm moving the task to blocked as we're waiting for security review of the repository and puppeteer.

I'll request concept/design review from Services during tomorrow's Reading API/Services Sync meeting.

Change 383831 had a related patch set uploaded (by Phuedx; owner: Phuedx):
[mediawiki/extensions/Popups@master] gateway/rest: Handle large "small" thumbnails

https://gerrit.wikimedia.org/r/383831

^ Ignore the above…

The timeline for getting this trial service reviewed and deployed on production hardware is lengthy. Quoting wt:Services/FirstDeployment:

note that you should allow a minimum of at least two to three weeks cadence

IIUC the security review is a pre-requisite, so we're likely looking at more than a month to find out how this service runs on production hardware.

Is there a cheaper way to find out? I've jokingly suggested renting a hefty VPS for 30 minutes but is that an option? /cc @GWicke @mobrovac @Pchelolo

Pchelolo added a comment.EditedOct 12 2017, 2:01 PM

Is there a cheaper way to find out? I've jokingly suggested renting a hefty VPS for 30 minutes but is that an option? /cc @GWicke @mobrovac @Pchelolo

You can always set up a machine in labs and add a public address to it just like we do with http://appservice.wmflabs.org and use that for testing. The load shouldn't be too high with testing right?

I went ahead and created pdfservice.services.eqiad.wmflabs for you exposed to the web under pdfservice.wmflabs.org

@bmansurov: Following on from T176627#3670765, I think we and Services need more detail. We should be providing median and 95/99th percentile timings for the service rendering S/M/L/XL/XL… pages and resource consumption on the server during those test runs. Moreover, these tests should be repeatable (i.e. there's a script that we can run by anyone [and at different times!]).

We all acknowledge that we'd have to take the results with a grain of salt as we'd be using a VPS-hosted instance but they'll be helpful in the interim while we're trying to get the service into production.

@Pchelolo, thanks for setting up pdfservice.wmflabs.org. We also have chromium-pdf.wmflabs.org. Should we be using the instance you created? Does it resemble a production environment closer than a regular Debian Jessie instance?

@bmansurov: Following on from T176627#3670765, I think we and Services need more detail. We should be providing median and 95/99th percentile timings for the service rendering S/M/L/XL/XL… pages and resource consumption on the server during those test runs. Moreover, these tests should be repeatable (i.e. there's a script that we can run by anyone [and at different times!]).

We all acknowledge that we'd have to take the results with a grain of salt as we'd be using a VPS-hosted instance but they'll be helpful in the interim while we're trying to get the service into production.

I remember us agreeing during standup that it wouldn't make much sense to measure these numbers on a labs instance. What has changed since we last talked?

phuedx added a comment.EditedOct 12 2017, 6:16 PM

What has changed since we last talked?

We now know that we're looking at ~1 month to get this service deployed and potentially longer get it undeployed if headless Chromium isn't a viable replacement for Electron. If we have an indicator ahead of time, then we might be able to save folk (including us) a lot of time.

Services have also requested that we do this while we begin working on deploying it.

It might be worth focusing more on robustness than simple-page latency, as that is the more critical issue with Electron. Previously, I tested with a few very large articles (see T142226#2537844). This tested timeout enforcement. Testing with a simulated overload (many concurrent requests for huge pages) could also be useful to ensure that concurrency limits and resource usage limits are thoroughly enforced.

Left some comments regarding the code: https://github.com/kodchi/mediawiki-services-chromium-render/commit/d02e5a57ec1e986f0992edaf6b8c8169d13b5203

Could you use pull requests to make changes? Even if you self-merge them instantly it's much easier to code-review with pull requests.

Anyone knows what version of node is supported in our infrastructure? puppeteer requrires version 6.4.0, and the above repo uses async/await so, we'll need at least version 7.6.0. How flexible are we in installing the required node version?

This might be a problem in the prod environment. On SCB right now all the services share the same node.js installation and right now we're still on version 6. Although 8 is becoming LTS very soon and we do have plans for upgrading it, the upgrade itself is a pretty involved and complicated project, so don't expect us do do it instantly..

The minimal requirement for puppeteer is now Node v6.4.0. Our production environment runs on v6.11.1 so we are good there, as long as you use promises instead of async/await.

Left some comments regarding the code: https://github.com/kodchi/mediawiki-services-chromium-render/commit/d02e5a57ec1e986f0992edaf6b8c8169d13b5203

Could you use pull requests to make changes? Even if you self-merge them instantly it's much easier to code-review with pull requests.

Anyone knows what version of node is supported in our infrastructure? puppeteer requrires version 6.4.0, and the above repo uses async/await so, we'll need at least version 7.6.0. How flexible are we in installing the required node version?

This might be a problem in the prod environment. On SCB right now all the services share the same node.js installation and right now we're still on version 6. Although 8 is becoming LTS very soon and we do have plans for upgrading it, the upgrade itself is a pretty involved and complicated project, so don't expect us do do it instantly..

Thanks, I'll submit fixes soon. Btw, we've moved to gerrit and the new repo lives at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/chromium-render. I'll ping you once the patch is ready.

Thanks, I'll submit fixes soon. Btw, we've moved to gerrit and the new repo lives at https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/services/chromium-render. I'll ping you once the patch is ready.

Perfecto! Just add me to the reviewers (it's Ppchelko)

Change 384732 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Untangle promises

https://gerrit.wikimedia.org/r/384732

I'm replying to some of the comments from github here:

  1. Re launching chromium on service start up, I think that's possible, but won't we have problems similar to the ones Electron service is having (described at T176627#3651737)?
  1. Re mixing styles has been addressed at https://gerrit.wikimedia.org/r/#/c/384073/ and https://gerrit.wikimedia.org/r/384732.

I'm replying to some of the comments from github here:

  1. Re launching chromium on service start up, I think that's possible, but won't we have problems similar to the ones Electron service is having (described at T176627#3651737)?

Yes, indeed. Having a fixed pool of Chromium instances might lead to the same problems. Instead, paying the price of spawning a new process on every request (~100ms) should be an acceptable price to pay for improved robustness. Alas, what will need to be controlled are:

  • resource consumption (mem, CPU): ideally, this should be controlled by firejail-ing the sub-processed spawned, but we can defer this point for later
  • time-boxed execution: there should be strong guarantees that a given Chromium render cycle (i.e. the sub-process' life) will not last longer than X seconds (configurable)
  • number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.
pmiazga added a comment.EditedNov 1 2017, 6:58 PM

@bmansurov this service doesn't work when you have a / in page title - example: https://en.wikipedia.org/wiki/CP/M-86.

on a GET request to http://localhost:3030/en.wikipedia.org/v1/pdf/CP/M-86/letter I get Cannot GET /en.wikipedia.org/v1/pdf/CP/M-86/letter error, is it something we can quickly fix? It might be helpful when we want to test WMF traffic.

Tests are done on latest patch that includes changing Letter and A4 to lowercase.

pmiazga added a comment.EditedNov 1 2017, 7:50 PM

@Jdlrobson it makes sense, but if I got with encoded title I get a PDF with an error:

{
  "type":"https://mediawiki.org/wiki/HyperSwitch/errors/bad_request",
  "title":"Invalid parameters",
  "method":"get",
  "detail":"data.params.revision should be an integer",
  "uri":"/en.wikipedia.org/v1/page/html/CP/M-86"
}

Change 387871 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Escape article title before sending it to RESTBase

https://gerrit.wikimedia.org/r/387871

Change 384732 merged by Bmansurov:
[mediawiki/services/chromium-render@master] Untangle promises

https://gerrit.wikimedia.org/r/384732

Change 387871 merged by Bmansurov:
[mediawiki/services/chromium-render@master] Escape article title before sending it to RESTBase

https://gerrit.wikimedia.org/r/387871

Change 387871 merged by Bmansurov:
[mediawiki/services/chromium-render@master] Escape article title before sending it to RESTBase

https://gerrit.wikimedia.org/r/387871

Just encoding the title is not enough, as that may provoke redirects. MW has a specific way of "encoding" titles, e.g. Main Page is not Main%20Page but Main_Page, etc. In order to work with the titles properly, you will need to use the mediawiki-title package. For examples on how to use it, cf. Mobile Content Service's logic.

Mental note: this should probably become part of service-template-node .

Change 387871 merged by Bmansurov:
[mediawiki/services/chromium-render@master] Escape article title before sending it to RESTBase

https://gerrit.wikimedia.org/r/387871

Just encoding the title is not enough, as that may provoke redirects. MW has a specific way of "encoding" titles, e.g. Main Page is not Main%20Page but Main_Page, etc. In order to work with the titles properly, you will need to use the mediawiki-title package. For examples on how to use it, cf. Mobile Content Service's logic.

Mental note: this should probably become part of service-template-node .

Doesn't RESTBase do this kind of transformation itself? For example, when I ping https://en.wikipedia.org/api/rest_v1/page/html/Main%20Page I get redirected to https://en.wikipedia.org/api/rest_v1/page/html/Main_Page.

The service itself is there and works perfectly fine, there are minor issues with it but we can fix those as a follow-up patches. Currently, we have like 2-3 tasks related to chromium rendering service that depends on it. I'm calling this task done, let's create separate tickets for anything related to this task.

phuedx added a comment.Nov 2 2017, 5:39 PM

🎉🎉🎉

phuedx closed this task as Resolved.EditedNov 6 2017, 3:56 PM

Being bold.

As @pmiazga says in T176627#3730213, there are a few minor issues that can be tracked in follow-on tasks.