Page MenuHomePhabricator

Limit resources used by Chromium in order to make the chromium-render service ready for production
Closed, ResolvedPublic8 Estimated Story Points

Description

We'd like to make the chromium-render service ready for production and stress testing by making the service robust.

Taken from @mobrovac's comments at T176627#3691956.

A/C

  • Every request should spawn a new headless Chromium instance;
  • time-boxed execution: there should be strong guarantees that a given Chromium render cycle (i.e. the sub-process' life) will not last longer than X seconds (configurable)
  • number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Sign off notes are here: T178501#3792086

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

I suppose so; haven't verified it though.

@bmansurov: Is it possible to do this (spawn X Chromium processes and connect to them) with puppeteer?

This is what Electron does at start-up, and the problem with that is that, just like your normal browser, the instances lock up memory over time. Moreover, if a request locks the process up, you are left with dysfunctional instances and observe the hanging problems.

Sorry @mobrovac, I phrased my question poorly. I'll try again:

Is it possible to spawn a new instance of Chromium for every request and connect to it with puppeteer?

Is it possible to spawn a new instance of Chromium for every request and connect to it with puppeteer?

Yes! In

const puppeteer = require( 'puppeteer' );
const browser = puppeteer.launch();

[[ https://github.com/GoogleChrome/puppeteer/blob/4f64dfd9935e87c2752a4db04867b4a9928c2cd3/lib/Puppeteer.js#L20-L26 | Pupetteer#launch ]] invokes [[ https://github.com/GoogleChrome/puppeteer/blob/4f64dfd9935e87c2752a4db04867b4a9928c2cd3/lib/Launcher.js#L60-L166 | Launcher#launch ]], which, amongst other things, launches a new instance of Chromium and connects to it.

Bringing the task into the sprint as it's a subtask of another task that's in the sprint already.

bmansurov renamed this task from Limit resources used by Chromium to [subtask] Limit resources used by Chromium.Oct 19 2017, 7:26 PM
bmansurov claimed this task.
bmansurov moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

@mobrovac, regarding:

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Gabriel advises against this in T176627#3647181 with a follow up comment in T176627#3651737. How do you think this requirement be implemented? Were you thinking of having the limits in the node service or outside it?

@mobrovac, regarding:

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Gabriel advises against this in T176627#3647181 with a follow up comment in T176627#3651737. How do you think this requirement be implemented? Were you thinking of having the limits in the node service or outside it?

Gabriel's comments refer to the way Electron works, whereby X processes are spawned when the service starts up and are kept around until its death, which causes the hanging issues. My suggestion above is different: there should be no processes spawned at the beginning, but instead a new Chromium instance should be spawned on each request. However, to limit the amount of resources during high load, the number of active Chromium processes should not surpass a preconfigured value. As an example, let the maximum number of processes be 5. For each of the first five requests, a new instance is spawned. However, for request #6, a Chomium instance cannot be created until any of the previous requests are complete and the Chromium process associated with it is killed.

@mobrovac thanks for clarifying. Since T167906 hasn't been implemented yet, I suppose you're suggesting we do this kind of process limiting inside the service, right?

@mobrovac thanks for clarifying. Since T167906 hasn't been implemented yet, I suppose you're suggesting we do this kind of process limiting inside the service, right?

This is orthogonal to the problems described in that ticket - the service must take care of its own resources regardless of the overall API limits and policies. So, yes, the aforementioned mechanism needs to be implemented in the service itself.

Change 387369 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Limit number of concurrent renders

https://gerrit.wikimedia.org/r/387369

Change 387642 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] WIP: add timeout to requests in the queue

https://gerrit.wikimedia.org/r/387642

Change 387369 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Limit number of concurrent renders

https://gerrit.wikimedia.org/r/387369

Change 389797 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Add timeout to requests in the queue

https://gerrit.wikimedia.org/r/389797

Change 390144 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Terminate browser process after certain time

https://gerrit.wikimedia.org/r/390144

Change 387642 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Add timeout to requests in the queue

https://gerrit.wikimedia.org/r/387642

Change 391072 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/services/chromium-render@master] Lower render queue timeout to 60 seconds

https://gerrit.wikimedia.org/r/391072

Change 391072 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Lower render queue timeout to 60 seconds

https://gerrit.wikimedia.org/r/391072

Change 389797 merged by Pmiazga:
[mediawiki/services/chromium-render@master] Add task count limit to requests in the queue

https://gerrit.wikimedia.org/r/389797

MBinder_WMF renamed this task from [subtask] Limit resources used by Chromium to Limit resources used by Chromium.Nov 15 2017, 6:07 PM
Jdlrobson set the point value for this task to 8.Nov 15 2017, 6:13 PM
bmansurov renamed this task from Limit resources used by Chromium to Limit resources used by Chromium in order to make the chromium-render service ready for production.Nov 15 2017, 7:18 PM
bmansurov updated the task description. (Show Details)

My findings after stress testing

  • when there is no memory (100 concurrent requests eats like 14GB of RAM) chromium instances start to fail, puppeteer handles that pretty well for up to 30 seconds and then whole node process goes down, sometimes taking down system applications (for example once it killed my phpstorm and two xterminal instances). For this one we have to do a proper firejail, otherwise we need a watchdog verifies service health (node process is there and running)
  • service works pretty well under 100% usage for very long time (max tested is 30 mins)
    • on a soft load (0-2 concurrent requests) service works pretty stable
  • queue keeps running even we close the browser connection - T180604
  • we need to add proper logging so it's easier to debug whats happening with the queue - T180601
  • when running ~4 concurrent requests and generating PDFs after some time chromium starts to render incomplete books. For example, after 10 minutes of processing 3-4 books at once, when I start asking to render the same book 20 times, some of PDF's will not have all pages. I'm affraid it may also happen for normal load, but first It would be nice if someone else can verify this behaviour on their instance. As an example,

24 concurrent requests for the same article after 15 minutes from starting the service and handling regular load. (Concurrency set to 5)

output/letter.3.pdf,      Size: 1607318,     Pages: 103
output/letter.3.1.pdf,    Size: 1663010,     Pages: 106
output/letter.3.2.pdf,    Size: 1844745,     Pages: 117
output/letter.3.3.pdf,    Size: 1925602,     Pages: 122
output/letter.3.4.pdf,    Size: 2022864,     Pages: 128
output/letter.3.5.pdf,    Size: 2041099,     Pages: 129
output/letter.3.6.pdf,    Size: 1989220,     Pages: 126
output/letter.3.7.pdf,    Size: 2073476,     Pages: 131
output/letter.3.8.pdf,    Size: 1976912,     Pages: 125
output/letter.3.9.pdf,    Size: 1993072,     Pages: 126
output/letter.3.10.pdf,   Size: 2098728,     Pages: 133
output/letter.3.11.pdf,   Size: 2254464,     Pages: 143
output/letter.3.12.pdf,   Size: 2439026,     Pages: 150
output/letter.3.13.pdf,   Size: 2666404,     Pages: 155
output/letter.3.14.pdf,   Size: 2516522,     Pages: 152
output/letter.3.15.pdf,   Size: 2685310,     Pages: 155
output/letter.3.16.pdf,   Size: 2800245,     Pages: 158
output/letter.3.17.pdf,   Size: 2861330,     Pages: 159
output/letter.3.18.pdf,   Size: 2746960,     Pages: 157
output/letter.3.19.pdf,   Size: 2861330,     Pages: 159
output/letter.3.20.pdf,   Size: 2861330,     Pages: 159
output/letter.3.21.pdf,   Size: 2861330,     Pages: 159
output/letter.3.22.pdf,   Size: 2792641,     Pages: 158
output/letter.3.23.pdf,   Size: 2746380,     Pages: 157

For 24 generated files, only 4 have correct size (2861330 bytes, 159 pages).
@bmansurov can you verify that this happens only on my machine?

Change 390144 merged by Mobrovac:
[mediawiki/services/chromium-render@master] Terminate browser process after certain time

https://gerrit.wikimedia.org/r/390144

Yes, I think the A/C has been met.

@phuedx we can move T178501#3767355 to performance test the service. During light load the service handles traffic well, but during stress tests with 2-3 concurrent requests, puppeteer tends to close some browser instances too early (that's my assumption as some rendered PDFs are rendered partially). I'm not sure if this is only my testing environment hiccups and I asked @bmansurov to try to reproduce this issue (wrong pdf size) locally on his machine.

I'll add "pdf verification step" to the T178278 and move this task to "Ready for sign off", testing is not required as we're going to do the performance testing when everything is ready.

Sign off notes:

Every request should spawn a new headless Chromium instance;

This is done by virtue of [[ https://github.com/wikimedia/mediawiki-services-chromium-render/blob/0451e29c860b3148801d7376ace193533c29337f/lib/renderer.js#L33 | using Puppeteer#launch when rendering ]], which launches a new instance of Chromium and connects to it, as I noted in T178501#3695924 above.

time-boxed execution: there should be strong guarantees that a given Chromium render cycle (i.e. the sub-process' life) will not last longer than X seconds (configurable)

Done in rMSCR0451e29c860b: Terminate browser process after certain time.

number of spawned Chromium instances: the service should track the number of currently-opened Chromium instances and not allow a configured maximum of them to be spawned; should there be a request waiting to be fulfilled, it should be postponed until such time as an instance slot becomes available or rejected if the maximum amount of wait time has elapsed.

Done in rMSCR137064be94a7: Add task count limit to requests in the queue.

phuedx updated the task description. (Show Details)

All the AC are met. This LGTM.

🎉🎉🎉