Page MenuHomePhabricator

Chromium-render doesn't handle browser connection abort well
Closed, ResolvedPublic3 Estimated Story Points

Description

Following on from our work stress testing the service there are some cases where we do not clean abort unnecessary aborts which consumes memory unnecessarily and we are worried might impact the reliability of the service.

While stress testing I found that when we puppeteer closes the browser connection two things happen:

  • sometimes we log a false-negative error after puppeteer closes the browser connection. This is because puppeteer still tries to manage chromium, but the browser is closed and it fails with an error (there are 3 or 4 distinct errors it may throw). We shouldn't log those errors as it might indicate something is wrong. Fixed as part of https://gerrit.wikimedia.org/r/#/c/393664/
  • sometimes after we abort the job it still renders and after some time logs "job completed" which means that job wasn't aborted.

Developer notes

We estimated 3s and 5s as there is a little risk involved here.
We agreed however that we should not get to deep into this task. If we discover this is more complex than a 3 we should take a step back as a team and reconsider this task - either creating a spike to do more analysis or re-evaluating user value.

QA steps

To simplify QA process I made a small tool API Link generator. The tool will help to create links to the API, instead of making URL by youself you can fill the fields and the tool will create a link for you with a button to get PDF. Tool also takes into consideration the fact, that Beta Cluster proton instance can print only articles from beta cluster wiki.

The easiest way to test the tool is to download the repository locally, install chromium, and then edit the config.dev.yaml file and change the render_concurrency to something bigger than 1 (5 should be fine)

npm install
npm run start

Then use the tool and point to the Local dev instance. With that approach dev install will be able to print the articles from production. Beta cluster instance doesn't have all articles/styles/templates/etc, it can be safely used for smoke tests but anything more rigorous we should use the production instance (not present yet) or the dev instance.

Note: if you use chrome, or if you keep chromium somewhere else, please edit the config.dev.yaml file and change the executablePath to point to your local chromium instance (/usr/bin/chromium is the default path).

Note 2: The chromium-render outputs very useful log, it tells you when the job got into the queue, when it started rendering, and when it finished. Keep eye on the log, it will be helpful when for example you need to abort the render.

Note 3: For testing it will be useful to print very long articles (that take ~5-10s to render, I used some articles from Long pages list

Things to test

First, by using API Link generator create couple URLs (for different articles). The easiest way to test this task is also to use a commandline tool, like wget or curl, it will be much easier to start/stop rendering.

  • try to generate PDFs for couple random articles, check that PDFs are rendered correctly, verify that after each render is finished all resources are freed (memory is released, there are no chromium processes in the background)
  • do the same but for concurrent renders (try to download url in different tabs/terminals at once). all renders should output correct PDF files, and chromium processes should exit by itself
  • try to generate the PDF and abort when it's still in the queue (did not start rendering), the process should leave immediately, chromium browser shouldn't be spawned
  • try to generate the PDF and abort when it's in the rendering state - the process should stop the chromium browser.
  • keep the service running and rendering stuff for some time (like one hour), for constant testing you can use tools like siege or jmeter), and verify that there are no zombie processes, and when you stop the rendering PDF none chromium processes are present.

In short, we want to test that after both successful render and aborted render script cleans up the environment properly - the memory is freed and there are no remaining chromium processes running in the background.

Related Objects

StatusSubtypeAssignedTask
Resolvedovasileva
ResolvedNone
ResolvedBawolff
Resolvedphuedx
Resolved mobrovac
Resolved mobrovac
Resolvedphuedx
ResolvedJdrewniak
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
Resolvedphuedx
DeclinedNone
Resolved bmansurov
Resolved mobrovac
Resolvedovasileva
InvalidNone
ResolvedJdlrobson
Resolvedphuedx
Resolvedphuedx
Resolved holger.knust
ResolvedTgr
Resolvedjijiki
ResolvedMSantos
Resolved mobrovac
Resolvedovasileva
Resolvedphuedx
Declinedpmiazga
ResolvedDzahn
Resolvedpmiazga
Duplicate holger.knust
ResolvedMSantos
ResolvedTgr
ResolvedJohan
OpenNone
Resolvedovasileva
InvalidNone
Resolved mobrovac
ResolvedJdrewniak

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Upstream described the rubric they use to prioritise bugs and clarified that they consider it low priority: https://github.com/GoogleChrome/puppeteer/issues/1505#issuecomment-393470731

If this continues to be an issue, then we could either work around it on our side by either:

  • Flagging the job as cancelled and drop any signals from Puppeteer; or
  • Fixing the upstream bug ourselves; or

There should be progress on this before we enter the full production stage, since keeping Chromium instances working while Proton thinks the resources are free can quickly lead to resource starvation scenarios on our scale.

@mobrovac We submitted an Issue in Puppeteer library: https://github.com/GoogleChrome/puppeteer/issues/1505 but they said it's a low-priority for them. We can try to fix the puppeteer lib by ourselves but I'm not sure if chromium itself allows stopping the PDF rendering. The second approach (the cruel one) would be just to kill the given chromium instance, but I'm not sure if this approach has no side effects.

We will log the amount of aborted request (we have different stats for the state when the request still waits in the queue and when it already started to render). If the number of aborted requests (which are currently in render state) is higher than X % (we have to find X, for now, let's say 10%), then we have to fix that issue.

@mobrovac We submitted an Issue in Puppeteer library: https://github.com/GoogleChrome/puppeteer/issues/1505 but they said it's a low-priority for them. We can try to fix the puppeteer lib by ourselves but I'm not sure if chromium itself allows stopping the PDF rendering. The second approach (the cruel one) would be just to kill the given chromium instance, but I'm not sure if this approach has no side effects.

Right, but it's not a low priority for us, so we need to do something about it. IMHO trying to first fix it in Puppeteer and submitting a PR upstream is the way forward here, especially because upsteam hasn't rejected the bug, which means they'd gladly accept fixes for it ;)

We will log the amount of aborted request (we have different stats for the state when the request still waits in the queue and when it already started to render). If the number of aborted requests (which are currently in render state) is higher than X % (we have to find X, for now, let's say 10%), then we have to fix that issue.

I don't think that's acceptable. At our scale, resource starvation is not to be considered an edge case, especially for heavy workloads such as PDF generation. We have already had cases of services starving resources, which always come unexpectedly and take a lot of time and effort to deal with in real time (the last occurrence was T199813 which managed to starve memory on nodes with a simple and lightweight get msg - send msg workflow). Therefore, since here we are aware of the problem, we absolutely have to deal with it before starting to send live traffic to Proton.

@mobrovac as a easy/hacky fix for now, we can use Browser.process() (kudos to @phuedx for finding that) to get the Chromium child process and kill it.

Change 451793 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] To prevent resource starvation, try to kill chromium after 3s

https://gerrit.wikimedia.org/r/451793

Change 451817 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Handle the stopping chromium rendering correclty

https://gerrit.wikimedia.org/r/451817

Regarding T181623#4495059, can anyone in Readers Web review @pmiazga's changes? I'll also ask during today's standup ritual.

Notes from yesterday's standup ritual:

Change 451793 merged by Mobrovac:
[mediawiki/services/chromium-render@master] To prevent resource starvation, try to kill chromium after 3s

https://gerrit.wikimedia.org/r/451793

blocked, we're waiting for @mobrovac to come back from vacation.,

Change 451817 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Handle the stopping chromium rendering correctly

https://gerrit.wikimedia.org/r/451817

Mentioned in SAL (#wikimedia-operations) [2018-09-12T06:56:54Z] <mobrovac@deploy1001> Started deploy [proton/deploy@ecb9a0e]: Update to Puppeteer v1.7.0 and fix browser connection abort handling - T181623

Mentioned in SAL (#wikimedia-operations) [2018-09-12T06:57:52Z] <mobrovac@deploy1001> Finished deploy [proton/deploy@ecb9a0e]: Update to Puppeteer v1.7.0 and fix browser connection abort handling - T181623 (duration: 00m 58s)

Looks like this task requires a technical QA - @Jdlrobson, @Niedzielski @Jdrewniak @nray - can one of you pick it up?

I've tested this on my local with Siege, and it appears to kill the chromium instances correctly, and the queue management seems to be working well.

I ran the following tests:

  • I ran tests with a large queue (100 items) and intermittently canceled each request, the render tasks were aborted correctly
  • Did the same and canceled all the requests at once, again no remaining chromium processes were visible
  • I ran Siege for about an hour, and intermittently killed some chromium workers, the workers were restarted by the renderer correctly followed by the 'Queue health check'
  • the rendering was aborted correctly when it took too long (renderTime=90000)
  • unplugging and replugging my network connection, and killing some chromium processes intermittently

Out of all this I only ever saw 1 exception, when I manually killed the main chromium instance

[2018-09-17T11:48:55.573Z] FATAL: proton/6798 on jdrewniakwmf2204.local: Callback was already called. (err.levelPath=fatal/service-runner/unhandled)
    Error: Callback was already called.
        at /Users/jdrewniak/Gerrit/mediawiki/services/chromium-render/node_modules/async/internal/onlyOnce.js:9:32
        at timedRender (/Users/jdrewniak/Gerrit/mediawiki/services/chromium-render/lib/queue.js:206:17)
        at Timeout.timeoutCallback [as _onTimeout] (/Users/jdrewniak/Gerrit/mediawiki/services/chromium-render/node_modules/async/timeout.js:74:13)
        at ontimeout (timers.js:466:11)
        at tryOnTimeout (timers.js:304:5)
        at Timer.listOnTimeout (timers.js:267:5)
[2018-09-17T11:48:55.688Z] ERROR: proton/6796 on jdrewniakwmf2204.local: worker died, restarting (message="worker died, restarting", worker_pid=6798, exit_code=1, levelPath=error/service-runner/master)
[2018-09-17T11:48:56.522Z]  INFO: proton/6975 on jdrewniakwmf2204.local: Queue health check. (queueSize=0, runningTasksCount=0, levelPath=info/queue)
[2018-09-17T11:48:56.541Z]  INFO: proton/6975 on jdrewniakwmf2204.local: Worker 6975 listening on 0.0.0.0:3030 (levelPath=info)

Presumably that's because I forcefully killed the chromium process, and the onlyOnce method was doing it's job in preventing the callback from being called multiple times. Afterwards, the chromium worker restarted successfully.

[...]
Presumably that's because I forcefully killed the chromium process, and the onlyOnce method was doing it's job in preventing the callback from being called multiple times. Afterwards, the chromium worker restarted successfully.

Trying to kill Chromium manually was a good test! While we will not be killing them manually, the kernel might decide to kill the process due to resourcing problems. As provided in the log entry, when that happens, the proton worker servicing the request dies as well, which means that the client receives no response. It would be good to add some code that checks for the existence of the browser process before issuing the kill() call.

@mobrovac that's a good point. I was going to add the check anyway because that error kills the worker which IMHO is pretty bad. The chromium-render process shouldn't die, all errors have to be handled properly. Also, if that error (callback called twice) is triggered during tests -> that's more than sure it will happen on production. I'll analyze the flow and I'll try to prevent similar cases.

I'm thinking loud here: -> There is only one case when we jump out of puppeteer scope, in renderer.closeBrowser()) we ask the browser to close and then we listen to the native ChildProcess exit event - if it didn't happen we kill the browser.
The scenario @Jdrewniak talks about - happens only when something other than our code (something other than puppetter&chromium-render) tries to manage the chromium process. If we start listening to exit event all the time (not only when we ask the browser to close), we could log&prevent such situations in a nice manner.

Sth like (pseudocode):

[...]
let isRenderInProgress = false;
[...]
 articleToPdf(url, headers, puppeteerOptions, pdfOptions) {
    isRenderInProgress = true;
    [...]
    puppeteer.then((response) => {
        isRenderInProgress = false;
        [...]
        resolve(response.pdf);
   }
 }

childProcess.on('exit', function() {
  if (isRenderInProgress) {
    // something bad happened, log and gracefully exit the queue
  }
});

I'll work on a patch to prevent situations @Jdrewniak mentioned, as the error is triggered by our "render timeout" logic. We need to clearTimeout when we return from timedRender call, this is clearly a bug.
I'm just wondering/thinking loud, how often something like that (something else kills our precious chromium instances) can happen on production, Is it important to log/do some special handling in such cases? /cc @mobrovac @phuedx

Change 462579 had a related patch set uploaded (by Pmiazga; owner: Pmiazga):
[mediawiki/services/chromium-render@master] Html2pdf route should return promise

https://gerrit.wikimedia.org/r/462579

Change 462579 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Html2pdf route should return promise

https://gerrit.wikimedia.org/r/462579

I've run the pdf-renderer again, overflown it with lots of request, left it on for an hour, intermittently killed the main chromium processes, killed the child processes, disconnected and reconnected the network, and I'm unable to reproduce my previous bug. Good job!