Page MenuHomePhabricator

Investigate 2019-03-01 Proton incident
Closed, ResolvedPublic

Description

Proton has a queue system to limit the number of simultaneous render jobs, but according to https://wikitech.wikimedia.org/wiki/Incident_documentation/20190301-proton it did not work as intended.

The Proton endpoint is a queue manager that does not do work other than checking if there are available workers and assigning render jobs to them. If there are no free workers (more precisely if the render queue is full) it returns a HTTP 503. A few dozen requests per second should not give it any trouble, and the number of simultaneously running renders should not go above queue limits.

The incident documentation says there were several dozen processes, some of them a month old. So maybe there is some situation (somehow triggered by high load?) where a rendering process gets stuck or zombified and it gets evicted from the queue without actually being terminated?

Event Timeline

The incident report mentions a hundred zombie processes stuck since Jan 30 (ie. for over a month). There was a request spike that day, causing a CPU and memory spike (and OOM kills), per T214975: proton experienced a period of high CPU usage, busy queue, lockups. The queue size is 50 per vars.yaml, but the incident report talks about 99 chromium processes, so clearly the queue failed to limit these. (That or some kind of double-counting is going on, 50 vs. 99 seems like an unlikely coincidence...)

The logs from T214975 show a bunch of timeouts (Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded) so the most likely explanation is that chromium instances do not get killed on timeout (either in general or when they get stuck) and high memory and CPU load gets them stuck somehow (and they stay stuck when load is reduced). Presumably we don't use the timeout option of firejail, maybe we should start with that. @MoritzMuehlenhoff, @akosiaris any thoughts?

The incident report mentions a hundred zombie processes stuck

Nope, not zombie proccesses (by zombie I mean https://en.wikipedia.org/wiki/Zombie_process). Their status was set to S (aka sleeping), not Z. Which is consistent with the fact the parent process was still alive.

That or some kind of double-counting is going on, 50 vs. 99 seems like an unlikely coincidence...

Could be, but I doubt it. The output of ps auxww |grep chromim|wc -l might very well count the pipeline itself, but that's just 1 process. Also threads are not displayed in the output of that command so that rules out a wrong counting of threads vs processes.

The logs from T214975 show a bunch of timeouts (Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded) so the most likely explanation is that chromium instances do not get killed on timeout (either in general or when they get stuck) and high memory and CPU load gets them stuck somehow (and they stay stuck when load is reduced). Presumably we don't use the timeout option of firejail, maybe we should start with that. @MoritzMuehlenhoff, @akosiaris any thoughts?

Does proton do any kind of supervision of the chromium instances? If not, maybe it's a path worth pursuing. From what I gather firejails timeout[1] would kill the entire sandbox after the timeout has passed killing requests.

[1] https://firejail.wordpress.com/features-3/man-firejail/

@akosiaris - yes, there is some logic that tries to kill the browser instances if browser.close() promise doesn't succeed[1]. Looks like this code requires some tuning, and most probably it should try to kill the browser instance after some time without waiting for promise resolution/failure.

[1] https://github.com/wikimedia/mediawiki-services-chromium-render/blob/master/lib/renderer.js#L72

@akosiaris - yes, there is some logic that tries to kill the browser instances if browser.close() promise doesn't succeed. Looks like this code requires some tuning, and most probably it should try to kill the browser instance after some time without waiting for promise resolution/failure.

Agreed. We try the two actions in series rather than making them race, i.e.

Renderer#_closeBrowser
const SIGKILL_AFTER = 3000; // 3 seconds

const childProcess = browser.process();
const t = setTimeout(() => {
    this._logger.log(
        'error/render', {
            msg: 'Killing chromium process',
            childId: childProcess.pid
        });

    try {
        childProcess.kill('SIGKILL');
    } catch (e) {}
}, SIGKILL_AFTER);

browser.close().then(() => {
    clearTimeout(t);

    resolve(); // Resolve the promise returned by Renderer#_closeBrowser
});

One alternative would be to launch the child Chromium process using firejail, which we're confident (?) does this kind of management.

@Tgr: When can this work be scheduled?

Better sooner than later I guess, switching over to Proton is a Q4 goal and I'll probably have less time once the MediaWiki REST API project spins up.

The question is, where to put the time limit? Since the Javascript logic already tries to set it, no reason not to fix that - so it could be puppeteer, puppeteer + firejail, or puppeteer + something on the system level (e.g. ulimit).

Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.

Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.

Not overhead in the sense of performance overhead, but more like time boxed upper bounded process kills. That is, it will kill all the processes in the firejail sandbox when the timeout is reached. An example below

$ firejail --timeout=00:00:04 bash -c "bash -c 'sleep 5; echo yes'"
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc
Warning: networking feature is disabled in Firejail configuration file

** Note: you can use --noprofile to disable default.profile **

Parent pid 1999, child pid 2000
Warning: cleaning all supplementary groups
Warning: cleaning all supplementary groups
Child process initialized in 67.28 ms

Parent is shutting down, bye...
$ firejail --timeout=00:00:06 bash -c "bash -c 'sleep 5; echo yes'"
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc
Warning: networking feature is disabled in Firejail configuration file

** Note: you can use --noprofile to disable default.profile **

Parent pid 2439, child pid 2440
Warning: cleaning all supplementary groups
Child process initialized in 52.60 ms
yes

Parent is shutting down, bye...

Note how in the first example, yes is never printed as the entire firejail sandbox is killed before sleep ever returns. The entire process (and all children if any) are killed. In the second case the timeout is reached after the sandbox has terminated on its own so yes is printed.

I am not sure how this would be instrumented in the code though. Would puppeteer just execute firejail chromium --blahblah instead of chromium --blahblah ? Cause this can't happen on the systemd level as it would end up needlessly killing and restarting the service.

Note that firejail is already used for proton (as pretty much all other services), albeit without --timeout (which was added in a later version and needs stretch-backports to obtain)

proton   16441  0.0  0.0  18264    12 ?        Ss   Apr08   0:00 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/proton/config.yaml

FWIW, running a firejail in a firejail works but does print out

Warning: an existing sandbox was detected. /bin/bash will run without any additional sandboxing features

Thanks, I didn't realize firejail wraps the whole service, not just the render processes. Let's go with JS-only then.

Change 503159 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/services/chromium-render@master] Make timeout handling more robust

https://gerrit.wikimedia.org/r/503159

Change 503159 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Make timeout handling more robust

https://gerrit.wikimedia.org/r/503159

Mentioned in SAL (#wikimedia-operations) [2019-05-14T17:20:23Z] <mbsantos@deploy1001> Started deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724)

Mentioned in SAL (#wikimedia-operations) [2019-05-14T17:22:46Z] <mbsantos@deploy1001> Finished deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724) (duration: 02m 23s)

The planned work has been done and T214975 can be used to track whether the problem reoccurs so no point in keeping this open.