Limit resources used by Chromium in order to make the chromium-render service ready for production
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	• bmansurov
	Oct 18 2017, 1:57 PM

Description

We'd like to make the chromium-render service ready for production and stress testing by making the service robust.

Taken from @mobrovac's comments at T176627#3691956.

A/C

Every request should spawn a new headless Chromium instance;
time-boxed execution: there should be strong guarantees that a given Chromium render cycle (i.e. the sub-process' life) will not last longer than X seconds (configurable)
number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Sign off notes are here: T178501#3792086

Details

Subject	Repo	Branch	Lines +/-
Terminate browser process after certain time	mediawiki/services/chromium-render	master	+162 -57
Add task count limit to requests in the queue	mediawiki/services/chromium-render	master	+148 -64
Lower render queue timeout to 60 seconds	mediawiki/services/chromium-render	master	+5 -5
Add timeout to requests in the queue	mediawiki/services/chromium-render	master	+214 -48
Limit number of concurrent renders	mediawiki/services/chromium-render	master	+201 -3 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• mobrovac	T159922 pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003
Invalid	None	T172815 Improve stability and maintainability of our browser-based PDF render service
Resolved	ovasileva	T181079 [GOAL] Provide an expanded reading experience by improving the ways that users can download articles of interest for later consumption
Resolved	None	T181084 [EPIC] Deploy the mediawiki-services-chromium-render service (Proton)
Resolved	phuedx	T181118 [EPIC] Build a Chromium-based PDF renderer service
Resolved	phuedx	T178501 Limit resources used by Chromium in order to make the chromium-render service ready for production

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2017, 1:57 PM

• bmansurov mentioned this in T176627: Trial replacing Electron with headless Chromium in the render service.Oct 18 2017, 1:57 PM

ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.Oct 18 2017, 3:47 PM

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

In T178501#3694700, @phuedx wrote:

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

I suppose so; haven't verified it though.

In T178501#3694700, @phuedx wrote:

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

This is what Electron does at start-up, and the problem with that is that, just like your normal browser, the instances lock up memory over time. Moreover, if a request locks the process up, you are left with dysfunctional instances and observe the hanging problems.

Sorry @mobrovac, I phrased my question poorly. I'll try again:

Is it possible to spawn a new instance of Chromium for every request and connect to it with puppeteer?

In T178501#3695865, @phuedx wrote:

Is it possible to spawn a new instance of Chromium for every request and connect to it with puppeteer?

Yes! In

const puppeteer = require( 'puppeteer' );
const browser = puppeteer.launch();

[[ https://github.com/GoogleChrome/puppeteer/blob/4f64dfd9935e87c2752a4db04867b4a9928c2cd3/lib/Puppeteer.js#L20-L26 | Pupetteer#launch ]] invokes [[ https://github.com/GoogleChrome/puppeteer/blob/4f64dfd9935e87c2752a4db04867b4a9928c2cd3/lib/Launcher.js#L60-L166 | Launcher#launch ]], which, amongst other things, launches a new instance of Chromium and connects to it.

Bringing the task into the sprint as it's a subtask of another task that's in the sprint already.

• bmansurov renamed this task from Limit resources used by Chromium to [subtask] Limit resources used by Chromium.Oct 19 2017, 7:26 PM

• bmansurov claimed this task.

• bmansurov moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

@mobrovac, regarding:

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Gabriel advises against this in T176627#3647181 with a follow up comment in T176627#3651737. How do you think this requirement be implemented? Were you thinking of having the limits in the node service or outside it?

In T178501#3697906, @bmansurov wrote:

@mobrovac, regarding:

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Gabriel advises against this in T176627#3647181 with a follow up comment in T176627#3651737. How do you think this requirement be implemented? Were you thinking of having the limits in the node service or outside it?

Gabriel's comments refer to the way Electron works, whereby X processes are spawned when the service starts up and are kept around until its death, which causes the hanging issues. My suggestion above is different: there should be no processes spawned at the beginning, but instead a new Chromium instance should be spawned on each request. However, to limit the amount of resources during high load, the number of active Chromium processes should not surpass a preconfigured value. As an example, let the maximum number of processes be 5. For each of the first five requests, a new instance is spawned. However, for request #6, a Chomium instance cannot be created until any of the previous requests are complete and the Chromium process associated with it is killed.

@mobrovac thanks for clarifying. Since T167906 hasn't been implemented yet, I suppose you're suggesting we do this kind of process limiting inside the service, right?

In T178501#3699432, @bmansurov wrote:

@mobrovac thanks for clarifying. Since T167906 hasn't been implemented yet, I suppose you're suggesting we do this kind of process limiting inside the service, right?

This is orthogonal to the problems described in that ticket - the service must take care of its own resources regardless of the overall API limits and policies. So, yes, the aforementioned mechanism needs to be implemented in the service itself.

Jdlrobson moved this task from Upcoming to 2017-18 Q2 on the Web-Team-Backlog board.Oct 20 2017, 8:14 PM

• bmansurov mentioned this in T178278: Performance test the service.Oct 30 2017, 6:41 PM

Change 387369 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Limit number of concurrent renders

https://gerrit.wikimedia.org/r/387369

gerritbot added a project: Patch-For-Review.Oct 30 2017, 8:58 PM

• bmansurov mentioned this in rMSCR159edf3efebd: Limit number of concurrent renders.Oct 30 2017, 9:18 PM

Change 387642 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] WIP: add timeout to requests in the queue

https://gerrit.wikimedia.org/r/387642

• bmansurov mentioned this in rMSCR214edaf50a9d: WIP: add timeout to requests in the queue.Oct 31 2017, 7:23 PM

• bmansurov mentioned this in rMSCRf239e68fd7bb: Limit number of concurrent renders.Nov 1 2017, 6:10 PM

• bmansurov mentioned this in rMSCRb1fa24a1974e: Limit number of concurrent renders.Nov 1 2017, 10:19 PM

• bmansurov mentioned this in rMSCRe341a5cff4c9: Limit number of concurrent renders.Nov 2 2017, 3:58 PM

• bmansurov mentioned this in rMSCR58639292eb2b: WIP: add timeout to requests in the queue.Nov 2 2017, 10:00 PM

• bmansurov mentioned this in rMSCRb2e5fbcbb33a: WIP: Add timeout to requests in the queue.Nov 3 2017, 8:43 PM

Change 387369 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Limit number of concurrent renders

https://gerrit.wikimedia.org/r/387369

• bmansurov mentioned this in rMSCRceefe48be143: Add timeout to requests in the queue.Nov 6 2017, 10:34 PM

phuedx mentioned this in T178189: [spike] Temporarily allow pushing large objects.Nov 7 2017, 12:44 PM

• bmansurov mentioned this in rMSCR0228898b3aa9: Add timeout to requests in the queue.Nov 7 2017, 4:29 PM

Change 389797 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Add timeout to requests in the queue

https://gerrit.wikimedia.org/r/389797

• bmansurov mentioned this in rMSCR0ed87943fa8c: Add timeout to requests in the queue.Nov 7 2017, 8:51 PM

• bmansurov mentioned this in rMSCR2b0fa7127d01: Add task count limit to requests in the queue.Nov 7 2017, 8:54 PM

• bmansurov mentioned this in rMSCRa1d9dc83adf0: Add timeout to requests in the queue.

Change 390144 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Terminate browser process after certain time

https://gerrit.wikimedia.org/r/390144

• bmansurov mentioned this in rMSCR6e4e3f68661e: Add timeout to requests in the queue.Nov 8 2017, 9:49 PM

• bmansurov mentioned this in rMSCRdebe0d4c4443: Terminate browser process after certain time.

• bmansurov mentioned this in rMSCR1e4dc7a4a19f: Add timeout to requests in the queue.Nov 9 2017, 11:16 PM

• bmansurov mentioned this in T180037: [Spike] Can the new render service run on Debian Stretch?.Nov 9 2017, 11:26 PM

Change 387642 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Add timeout to requests in the queue

https://gerrit.wikimedia.org/r/387642

Change 391072 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Lower render queue timeout to 60 seconds

https://gerrit.wikimedia.org/r/391072

• bmansurov mentioned this in rMSCRd284ad0e66b9: Lower render queue timeout to 60 seconds.Nov 13 2017, 8:03 PM

• bmansurov mentioned this in rMSCRd54c03038b50: Add task count limit to requests in the queue.Nov 13 2017, 9:36 PM

• bmansurov mentioned this in rMSCR4dc17d03cff0: Lower render queue timeout to 60 seconds.Nov 14 2017, 4:20 PM

• bmansurov mentioned this in rMSCR83a1a7127348: Lower render queue timeout to 60 seconds.Nov 14 2017, 8:46 PM

Change 391072 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Lower render queue timeout to 60 seconds

https://gerrit.wikimedia.org/r/391072

• bmansurov mentioned this in rMSCR137064be94a7: Add task count limit to requests in the queue.Nov 14 2017, 11:24 PM

Change 389797 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Add task count limit to requests in the queue

https://gerrit.wikimedia.org/r/389797

phuedx mentioned this in T180601: Add logging to the mediawiki-services-chromium-render service.Nov 15 2017, 2:14 PM

MBinder_WMF renamed this task from [subtask] Limit resources used by Chromium to Limit resources used by Chromium.Nov 15 2017, 6:07 PM

Jdlrobson set the point value for this task to 8.Nov 15 2017, 6:13 PM

• bmansurov renamed this task from Limit resources used by Chromium to Limit resources used by Chromium in order to make the chromium-render service ready for production.Nov 15 2017, 7:18 PM

• bmansurov updated the task description. (Show Details)

I've extracted a sub-A/C into T180626: [Spike 8hr] How should we limit resources used by chromium render service?.

• bmansurov mentioned this in rMSCR373ea694b73a: WIP: Terminate browser process after certain time.Nov 15 2017, 11:25 PM

• bmansurov moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.Nov 16 2017, 1:22 PM

• bmansurov mentioned this in rMSCRbf01b2ff58d7: Terminate browser process after certain time.Nov 16 2017, 1:24 PM

My findings after stress testing

when there is no memory (100 concurrent requests eats like 14GB of RAM) chromium instances start to fail, puppeteer handles that pretty well for up to 30 seconds and then whole node process goes down, sometimes taking down system applications (for example once it killed my phpstorm and two xterminal instances). For this one we have to do a proper firejail, otherwise we need a watchdog verifies service health (node process is there and running)
service works pretty well under 100% usage for very long time (max tested is 30 mins)
- on a soft load (0-2 concurrent requests) service works pretty stable
queue keeps running even we close the browser connection - T180604
we need to add proper logging so it's easier to debug whats happening with the queue - T180601
when running ~4 concurrent requests and generating PDFs after some time chromium starts to render incomplete books. For example, after 10 minutes of processing 3-4 books at once, when I start asking to render the same book 20 times, some of PDF's will not have all pages. I'm affraid it may also happen for normal load, but first It would be nice if someone else can verify this behaviour on their instance. As an example,

24 concurrent requests for the same article after 15 minutes from starting the service and handling regular load. (Concurrency set to 5)

output/letter.3.pdf,      Size: 1607318,     Pages: 103
output/letter.3.1.pdf,    Size: 1663010,     Pages: 106
output/letter.3.2.pdf,    Size: 1844745,     Pages: 117
output/letter.3.3.pdf,    Size: 1925602,     Pages: 122
output/letter.3.4.pdf,    Size: 2022864,     Pages: 128
output/letter.3.5.pdf,    Size: 2041099,     Pages: 129
output/letter.3.6.pdf,    Size: 1989220,     Pages: 126
output/letter.3.7.pdf,    Size: 2073476,     Pages: 131
output/letter.3.8.pdf,    Size: 1976912,     Pages: 125
output/letter.3.9.pdf,    Size: 1993072,     Pages: 126
output/letter.3.10.pdf,   Size: 2098728,     Pages: 133
output/letter.3.11.pdf,   Size: 2254464,     Pages: 143
output/letter.3.12.pdf,   Size: 2439026,     Pages: 150
output/letter.3.13.pdf,   Size: 2666404,     Pages: 155
output/letter.3.14.pdf,   Size: 2516522,     Pages: 152
output/letter.3.15.pdf,   Size: 2685310,     Pages: 155
output/letter.3.16.pdf,   Size: 2800245,     Pages: 158
output/letter.3.17.pdf,   Size: 2861330,     Pages: 159
output/letter.3.18.pdf,   Size: 2746960,     Pages: 157
output/letter.3.19.pdf,   Size: 2861330,     Pages: 159
output/letter.3.20.pdf,   Size: 2861330,     Pages: 159
output/letter.3.21.pdf,   Size: 2861330,     Pages: 159
output/letter.3.22.pdf,   Size: 2792641,     Pages: 158
output/letter.3.23.pdf,   Size: 2746380,     Pages: 157

For 24 generated files, only 4 have correct size (2861330 bytes, 159 pages).
@bmansurov can you verify that this happens only on my machine?

• bmansurov mentioned this in rMSCReefe2b88309f: Terminate browser process after certain time.Nov 17 2017, 5:36 PM

• bmansurov mentioned this in rMSCR204ab2a7d08d: Terminate browser process after certain time.Nov 21 2017, 9:02 PM

• bmansurov mentioned this in rMSCR0cd0b21ceee6: Terminate browser process after certain time.Nov 21 2017, 10:38 PM

phuedx removed a parent task: T176627: Trial replacing Electron with headless Chromium in the render service.Nov 22 2017, 8:24 AM

phuedx added a parent task: T181118: [EPIC] Build a Chromium-based PDF renderer service.

• bmansurov mentioned this in rMSCRfe32cd08171f: Terminate browser process after certain time.Nov 22 2017, 3:58 PM

• bmansurov mentioned this in rMSCR0451e29c860b: Terminate browser process after certain time.Nov 23 2017, 7:06 PM

Change 390144 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Terminate browser process after certain time

https://gerrit.wikimedia.org/r/390144

Now that rMSCR0451e29c860b: Terminate browser process after certain time is merged, what's the status of this task? Have the AC been met?

Also, @pmiazga: should T178501#3767355 be moved to T178278: Performance test the service?

Yes, I think the A/C has been met.

@phuedx we can move T178501#3767355 to performance test the service. During light load the service handles traffic well, but during stress tests with 2-3 concurrent requests, puppeteer tends to close some browser instances too early (that's my assumption as some rendered PDFs are rendered partially). I'm not sure if this is only my testing environment hiccups and I asked @bmansurov to try to reproduce this issue (wrong pdf size) locally on his machine.

I'll add "pdf verification step" to the T178278 and move this task to "Ready for sign off", testing is not required as we're going to do the performance testing when everything is ready.

pmiazga moved this task from Needs Code Review to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Nov 27 2017, 1:34 PM

Thanks, both.

👍

phuedx removed • bmansurov as the assignee of this task.Nov 27 2017, 1:48 PM

phuedx claimed this task.Nov 27 2017, 6:04 PM

phuedx updated the task description. (Show Details)Nov 28 2017, 11:40 AM

Sign off notes:

Every request should spawn a new headless Chromium instance;

This is done by virtue of [[ https://github.com/wikimedia/mediawiki-services-chromium-render/blob/0451e29c860b3148801d7376ace193533c29337f/lib/renderer.js#L33 | using Puppeteer#launch when rendering ]], which launches a new instance of Chromium and connects to it, as I noted in T178501#3695924 above.

time-boxed execution: there should be strong guarantees that a given Chromium render cycle (i.e. the sub-process' life) will not last longer than X seconds (configurable)

Done in rMSCR0451e29c860b: Terminate browser process after certain time.

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Done in rMSCR137064be94a7: Add task count limit to requests in the queue.

All the AC are met. This LGTM.

🎉🎉🎉

Limit resources used by Chromium in order to make the chromium-render service ready for productionClosed, ResolvedPublic8 Estimated Story PointsActions

Description

A/C

Details

Related ObjectsSearch...

Event Timeline

Limit resources used by Chromium in order to make the chromium-render service ready for production
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...