Investigate 2019-03-01 Proton incident
Closed, ResolvedPublic
Actions

Description

Proton has a queue system to limit the number of simultaneous render jobs, but according to https://wikitech.wikimedia.org/wiki/Incident_documentation/20190301-proton it did not work as intended.

The Proton endpoint is a queue manager that does not do work other than checking if there are available workers and assigning render jobs to them. If there are no free workers (more precisely if the render queue is full) it returns a HTTP 503. A few dozen requests per second should not give it any trouble, and the number of simultaneously running renders should not go above queue limits.

The incident documentation says there were several dozen processes, some of them a month old. So maybe there is some situation (somehow triggered by high load?) where a rendering process gets stuck or zombified and it gets evicted from the queue without actually being terminated?

Details

	Subject	Repo	Branch	Lines +/-
	Make timeout handling more robust	mediawiki/services/chromium-render	master	+52 -20

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	ovasileva	T181079 [GOAL] Provide an expanded reading experience by improving the ways that users can download articles of interest for later consumption
Resolved	None	T181084 [EPIC] Deploy the mediawiki-services-chromium-render service (Proton)
Invalid	None	T214975 proton experienced a period of high CPU usage, busy queue, lockups
Resolved	Tgr	T217724 Investigate 2019-03-01 Proton incident

Event Timeline

Tgr created this task.Mar 6 2019, 7:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 6 2019, 7:25 AM

• Jhernandez triaged this task as High priority.Mar 6 2019, 4:33 PM

• Jhernandez moved this task from Needs triage to Upcoming on the Product-Infrastructure-Team-Backlog-Deprecated board.

The incident report mentions a hundred zombie processes stuck since Jan 30 (ie. for over a month). There was a request spike that day, causing a CPU and memory spike (and OOM kills), per T214975: proton experienced a period of high CPU usage, busy queue, lockups. The queue size is 50 per vars.yaml, but the incident report talks about 99 chromium processes, so clearly the queue failed to limit these. (That or some kind of double-counting is going on, 50 vs. 99 seems like an unlikely coincidence...)

The logs from T214975 show a bunch of timeouts (Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded) so the most likely explanation is that chromium instances do not get killed on timeout (either in general or when they get stuck) and high memory and CPU load gets them stuck somehow (and they stay stuck when load is reduced). Presumably we don't use the timeout option of firejail, maybe we should start with that. @MoritzMuehlenhoff, @akosiaris any thoughts?

In T217724#5007456, @Tgr wrote:

The incident report mentions a hundred zombie processes stuck

Nope, not zombie proccesses (by zombie I mean https://en.wikipedia.org/wiki/Zombie_process). Their status was set to S (aka sleeping), not Z. Which is consistent with the fact the parent process was still alive.

That or some kind of double-counting is going on, 50 vs. 99 seems like an unlikely coincidence...

Could be, but I doubt it. The output of ps auxww |grep chromim|wc -l might very well count the pipeline itself, but that's just 1 process. Also threads are not displayed in the output of that command so that rules out a wrong counting of threads vs processes.

The logs from T214975 show a bunch of timeouts (Unexpected error: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded) so the most likely explanation is that chromium instances do not get killed on timeout (either in general or when they get stuck) and high memory and CPU load gets them stuck somehow (and they stay stuck when load is reduced). Presumably we don't use the timeout option of firejail, maybe we should start with that. @MoritzMuehlenhoff, @akosiaris any thoughts?

Does proton do any kind of supervision of the chromium instances? If not, maybe it's a path worth pursuing. From what I gather firejails timeout[1] would kill the entire sandbox after the timeout has passed killing requests.

[1] https://firejail.wordpress.com/features-3/man-firejail/

• kchapman added a project: Platform Engineering.Mar 7 2019, 5:20 PM

pmiazga subscribed.Mar 7 2019, 5:38 PM

• kchapman moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Mar 12 2019, 2:14 PM

• kchapman moved this task from Triage Meeting Inbox to Needs Cleaning - Security, stability, performance, and scalability (TEC1) on the Platform Engineering board.

• kchapman edited projects, added Platform Engineering (Needs Cleaning - Security, stability, performance, and scalability (TEC1)); removed Platform Engineering.

@akosiaris - yes, there is some logic that tries to kill the browser instances if browser.close() promise doesn't succeed[1]. Looks like this code requires some tuning, and most probably it should try to kill the browser instance after some time without waiting for promise resolution/failure.

[1] https://github.com/wikimedia/mediawiki-services-chromium-render/blob/master/lib/renderer.js#L72

• Jhernandez added a parent task: T181084: [EPIC] Deploy the mediawiki-services-chromium-render service (Proton).Mar 25 2019, 9:43 AM

In T217724#5044953, @pmiazga wrote:

@akosiaris - yes, there is some logic that tries to kill the browser instances if browser.close() promise doesn't succeed. Looks like this code requires some tuning, and most probably it should try to kill the browser instance after some time without waiting for promise resolution/failure.

Agreed. We try the two actions in series rather than making them race, i.e.

Renderer#_closeBrowser

const SIGKILL_AFTER = 3000; // 3 seconds

const childProcess = browser.process();
const t = setTimeout(() => {
    this._logger.log(
        'error/render', {
            msg: 'Killing chromium process',
            childId: childProcess.pid
        });

    try {
        childProcess.kill('SIGKILL');
    } catch (e) {}
}, SIGKILL_AFTER);

browser.close().then(() => {
    clearTimeout(t);

    resolve(); // Resolve the promise returned by Renderer#_closeBrowser
});

One alternative would be to launch the child Chromium process using firejail, which we're confident (?) does this kind of management.

@Tgr: When can this work be scheduled?

MJL subscribed.Apr 7 2019, 9:33 PM

Better sooner than later I guess, switching over to Proton is a Q4 goal and I'll probably have less time once the MediaWiki REST API project spins up.

The question is, where to put the time limit? Since the Javascript logic already tries to set it, no reason not to fix that - so it could be puppeteer, puppeteer + firejail, or puppeteer + something on the system level (e.g. ulimit).

Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.

Firejail seems a nice extra for robustness, but @akosiaris seemed to suggest in T217724#5008302 that that might introduce some overhead. AIUI the sandbox is discarded anyway after every PDF render, to reduce the likelihood of problems caused by state leaking over to the next render, but I might be misunderstanding.

Not overhead in the sense of performance overhead, but more like time boxed upper bounded process kills. That is, it will kill all the processes in the firejail sandbox when the timeout is reached. An example below

$ firejail --timeout=00:00:04 bash -c "bash -c 'sleep 5; echo yes'"
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc
Warning: networking feature is disabled in Firejail configuration file

** Note: you can use --noprofile to disable default.profile **

Parent pid 1999, child pid 2000
Warning: cleaning all supplementary groups
Warning: cleaning all supplementary groups
Child process initialized in 67.28 ms

Parent is shutting down, bye...
$ firejail --timeout=00:00:06 bash -c "bash -c 'sleep 5; echo yes'"
Reading profile /etc/firejail/default.profile
Reading profile /etc/firejail/disable-common.inc
Reading profile /etc/firejail/disable-passwdmgr.inc
Reading profile /etc/firejail/disable-programs.inc
Warning: networking feature is disabled in Firejail configuration file

** Note: you can use --noprofile to disable default.profile **

Parent pid 2439, child pid 2440
Warning: cleaning all supplementary groups
Child process initialized in 52.60 ms
yes

Parent is shutting down, bye...

Note how in the first example, yes is never printed as the entire firejail sandbox is killed before sleep ever returns. The entire process (and all children if any) are killed. In the second case the timeout is reached after the sandbox has terminated on its own so yes is printed.

I am not sure how this would be instrumented in the code though. Would puppeteer just execute firejail chromium --blahblah instead of chromium --blahblah ? Cause this can't happen on the systemd level as it would end up needlessly killing and restarting the service.

Note that firejail is already used for proton (as pretty much all other services), albeit without --timeout (which was added in a later version and needs stretch-backports to obtain)

proton   16441  0.0  0.0  18264    12 ?        Ss   Apr08   0:00 /usr/bin/firejail --blacklist=/root --blacklist=/home --caps --seccomp /usr/bin/nodejs src/server.js -c /etc/proton/config.yaml

FWIW, running a firejail in a firejail works but does print out

Warning: an existing sandbox was detected. /bin/bash will run without any additional sandboxing features

Thanks, I didn't realize firejail wraps the whole service, not just the render processes. Let's go with JS-only then.

Change 503159 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/services/chromium-render@master] Make timeout handling more robust

https://gerrit.wikimedia.org/r/503159

gerritbot added a project: Patch-For-Review.Apr 12 2019, 3:11 AM

Tgr mentioned this in rMSCR08657232ce5c: Make timeout handling more robust.Apr 12 2019, 3:14 AM

Tgr mentioned this in rMSCR455159698c1a: Make timeout handling more robust.Apr 12 2019, 3:28 AM

Tgr moved this task from To Do to Code Review on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Apr 12 2019, 3:48 AM

Tgr mentioned this in rMSCR8cc96e7eaf33: Make timeout handling more robust.Apr 12 2019, 6:55 PM

Change 503159 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Make timeout handling more robust

https://gerrit.wikimedia.org/r/503159

Tgr moved this task from Code Review to To Deploy on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Apr 30 2019, 3:34 PM

Tgr added a parent task: T214975: proton experienced a period of high CPU usage, busy queue, lockups.May 9 2019, 8:59 PM

Tgr mentioned this in T214975: proton experienced a period of high CPU usage, busy queue, lockups.May 10 2019, 2:43 AM

Mentioned in SAL (#wikimedia-operations) [2019-05-14T17:20:23Z] <mbsantos@deploy1001> Started deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724)

Mentioned in SAL (#wikimedia-operations) [2019-05-14T17:22:46Z] <mbsantos@deploy1001> Finished deploy [proton/deploy@881b22b]: Update chromium-render to 8cc96e7 make timeout handler more robust (T217724) (duration: 02m 23s)

MSantos moved this task from To Deploy to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.May 14 2019, 5:36 PM

The planned work has been done and T214975 can be used to track whether the problem reoccurs so no point in keeping this open.

Investigate 2019-03-01 Proton incidentClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate 2019-03-01 Proton incident
Closed, ResolvedPublic
Actions

Related Objects
Search...