Page MenuHomePhabricator

[Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books
Closed, ResolvedPublic

Description

We want to see if headless Chromium is a viable solution for rendering books of various sizes.

Generate PDFs of books with the following number of pages (approximately):

  • 10
  • 25
  • 100
  • 1,000
  • 2,500
  • 5,000
  • 10,000

And measure the following for each book:

  • CPU usage;
  • Memory usage;
  • Time spent.

Make sure to render each type of book multiple times and take the average of measurements.

A/C

  • Settle on a Python wrapper for headless Chromium and document the decision;
  • Create a simple Python service that takes a URL of a page and generates a PDF from the contents of that URL; share the source code in this task!
  • Upload the measurements from above; Don't forget to include your environment setup too.
  • Upload the resulted PDFs here.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2017, 6:37 PM

I assume you mean Chromium? Google Chrome isn't free software.

Yep, will fix it.

bmansurov renamed this task from Investigate the ability of Python wrapped headless Chrome to render large books to Investigate the ability of Python wrapped headless Chromium to render large books.Sep 13 2017, 6:39 PM
bmansurov updated the task description. (Show Details)
bmansurov added subscribers: ovasileva, phuedx.

@bmansurov: Is the 2500 number grounded in anything? If not, I suggest increasing the scope of the spike to include profiling CPU utilisation, memory consumption, and wall time as a function of book size.

No, I remember the number came up during our meeting earlier. Your suggestion sounds good.

bmansurov updated the task description. (Show Details)Sep 13 2017, 6:47 PM
bmansurov updated the task description. (Show Details)Sep 13 2017, 6:53 PM
ovasileva renamed this task from Investigate the ability of Python wrapped headless Chromium to render large books to [Spike] Investigate the ability of Python wrapped headless Chromium to render large books.Sep 13 2017, 6:54 PM
ovasileva triaged this task as High priority.
ovasileva added projects: Spike, Proton.
ovasileva moved this task from Triage to Current Sprint on the Proton board.Sep 13 2017, 7:08 PM
ovasileva renamed this task from [Spike] Investigate the ability of Python wrapped headless Chromium to render large books to [Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books.Sep 14 2017, 5:15 PM
bmansurov updated the task description. (Show Details)Sep 14 2017, 6:00 PM

There is an official Node API for headless Chromium [1] called puppeteer and
there is a unofficial Python port [2] called pyppeteer. The Python port
seems to be actively developed. However, printing capabitlity looks
unfinished [3] and doesn't work (times out with RESTBase URLs).

The printing options such as disabling the automatic header and footer are not
exposed to the command line either [4].

Our only option seems to use puppeteer (i.e. the Node version), which means
that in addition to creating a service for headless Chromium, we'll have to
create a new service for PDF post-processing in Python. @phuedx, @ovasileva
what do you think? Should we change this task to use Node.js?

[1] https://github.com/GoogleChrome/puppeteer
[2] https://github.com/miyakogi/pyppeteer
[3] https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.pdf
[4] https://bugs.chromium.org/p/chromium/issues/detail?id=603559

bmansurov added a comment.EditedSep 18 2017, 9:12 PM

Setup

System

$ uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15) x86_64 GNU/Linux

CPU

$ cat /proc/cpuinfo | grep "MHz\|cores"
cpu MHz		: 2807.948
cpu cores	: 1

Memory

$ free -m
             total       used       free     shared    buffers     cached
Mem:         12043        246      11797          8          9        152
-/+ buffers/cache:         84      11959
Swap:          382          0        382

HTML

  • Script that downloads articles and creates a combined HTML: P6015 Modifications of the script has been used to create books of various sizes.

Measurement

  • Puppeteer downloads a version of Chromium for internal use. Initially I had a hard time getting pupputeer working locally as it was complaining about sandboxing issues. So I used the the version of chromium that it has to manually render PDFs. After the measurements, I gave puppeteer another try and I got it working too. I had to pass some flags in order to get it working. Here is the script I used:
const puppeteer = require('puppeteer');

(async () => {
	const browser = await puppeteer.launch({args: [ '--headless', '--disable-gpu', '--no-sandbox', '--disable-setuid-sandbox']});
	const page = await browser.newPage();
	await page.goto('http://mediawiki_local/w/capitals.html', {waitUntil: 'networkidle'});
	await page.pdf({path: 'capitals.pdf', format: 'A4'});

	browser.close();
})();

I also ran a couple of smaller tests and the \time results were comparable to what I got from directly running headless chromium.

  • The /usr/bin/time command has been used for measuring time spent, CPU, and memory usage. The command looks something like this:
\time -f "Elapsed real time: %E\nCPU percentage used: %P\nMaximum resident set size (RSS in Kbytes): %M" node_modules/puppeteer/.local-chromium/linux-497674/chrome-linux/chrome --headless --disable-gpu --no-sandbox --print-to-pdf http://mediawiki.local/w/book.html
  • Each test has been carried out 3 times.
  • /usr/bin/time version:
$ \time --version
GNU time 1.7

Although 'CPU percentage used' reported by \time is a low number, a subprocess spawned by the main process is always occupying more than 90% CPU time and in most cases the number is even 100%. This has been verified by the top command. One consequence of this is that if multiple renders occur at the same time, the reported times below will increase. Those times were measured when only one render process was being executed.

Results

(size in parenthesis)

**Average**
Elapsed real time: 0:02
CPU percentage used: 17%
Maximum resident set size (in Mbytes): 88

**Detailed**
Elapsed real time: 0:02.02
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 88628

Elapsed real time: 0:02.00
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 88912

Elapsed real time: 0:02.02
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 87456
**Average (rounded)**
Elapsed real time: 0:05
CPU percentage used: 12%
Maximum resident set size (in Mbytes): 111

**Detailed**
Elapsed real time: 0:04.73
CPU percentage used: 13%
Maximum resident set size (in Kbytes): 110968

Elapsed real time: 0:04.63
CPU percentage used: 12%
Maximum resident set size (in Kbytes): 110536

Elapsed real time: 0:04.50
CPU percentage used: 12%
Maximum resident set size (in Kbytes): 111320
  • 157 page PDF (21M - Couldn't upload as Phabricator has a limit of 10M)
**Average**
Elapsed real time: 0:33
CPU percentage used: 8%
Maximum resident set size (in Mbytes): 277

**Detailed**
Elapsed real time: 0:31.30
CPU percentage used: 8%
Maximum resident set size (in Kbytes): 276964

Elapsed real time: 0:30.51
CPU percentage used: 7%
Maximum resident set size (in Kbytes): 277468

Elapsed real time: 0:33.85
CPU percentage used: 8%
Maximum resident set size (in Kbytes): 277844

**Unrelated**
# So I wanted to see how running 3 jobs at the same time would affect the running time. Here are the results.
# In the above cases when each job is executed sequentially, we'd spend about 1 min 35 secs, and here all 3 jobs running at the same, finished slightly later.
Elapsed real time: 1:44.29
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 275132

Elapsed real time: 1:43.80
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 275320

Elapsed real time: 1:42.82
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 274848
  • 504 page PDF (67M)
**Average**
Elapsed real time: 1:51
CPU percentage used: 6%
Maximum resident set size (in Mbytes): 732

**Detailed**
Elapsed real time: 1:50.02
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 732332

Elapsed real time: 1:51.88
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 730944

Elapsed real time: 1:50.62
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 731472
  • 995 page PDF (128M)
**Average**
Elapsed real time: 5:10
CPU percentage used: 4%
Maximum resident set size (in Mbytes): 1310

**Detailed**
Elapsed real time: 5:02.61
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1310288

Elapsed real time: 5:09.63
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1310420

Elapsed real time: 5:19.08
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1309904
  • 2419 page PDF (192M)
**Average**
Elapsed real time: 24:45
CPU percentage used: 2%
Maximum resident set size (in Mbytes): 1902

**Detailed**
Elapsed real time: 30:53.04
CPU percentage used: 2%
Maximum resident set size (in Kbytes): 1902136

Elapsed real time: 22:07.00
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 1902464

Elapsed real time: 21:17.78
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 1902696
  • 4837 page PDF (269M)
**Average**
Elapsed real time: 1:36:37
CPU percentage used: 1%
Maximum resident set size (in Mbytes): 2613

**Detailed**
Elapsed real time: 1:49:50
CPU percentage used: 1%
Maximum resident set size (in Kbytes): 2612624

Elapsed real time: 1:17:13
CPU percentage used: 1%
Maximum resident set size (in Kbytes): 2613088

Elapsed real time: 1:44:53
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 2612392
  • 10,000 page PDF

Haven't attempted rendering a PDF this big as (a) it takes a lot of time; (b) it seems it won't fail given how much memory the ~5,000 page PDF has used; (c) it's unlikely someone will render this big of a book, because that will mean they'll have to add roughly a 1,000 articles to the book.

Conclusions

Headless Chromium is able to render books with thousands of pages. It just takes a long time. We'll need machines with good CPU power if we want to speed up the render time. We'll have to work with a designer to change the UI (for example, by limiting the number of articles one can add to a book) and redo the back-end to render PDFs as queued jobs and notify the user via an email or an echo notification when their PDF is ready.

Measurement

  • Puppeteer downloads a version of Chromium for internal use. I had a hard time getting pupputeer working locally as it was complaining about sandboxing issues. So I used the the version of chromium that it has to manually render PDFs.

Do you have any notes around this? If we're planning on switching out Electron for headless Chromium, then the more foreknowledge we have, the better.

Do you have any notes around this? If we're planning on switching out Electron for headless Chromium, then the more foreknowledge we have, the better.

Updated the initial comment.

bmansurov removed bmansurov as the assignee of this task.Sep 19 2017, 8:16 PM

Conclusions

Headless Chromium is able to render books with thousands of pages. It just takes a long time. We'll need machines with good CPU power if we want to speed up the render time. We'll have to work with a designer to change the UI (for example, by limiting the number of articles one can add to a book) and redo the back-end to render PDFs as queued jobs and notify the user via an email or an echo notification when their PDF is ready.

Not too bad in terms of results. I would say our next step would be to meet and discuss all the options ^

closing this for now. Next steps for post-processing and headless chrome to be tracked in T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies

Not too bad in terms of results. I would say our next step would be to meet and discuss all the options ^

Could you add a brief summary of this meeting for clarity/completeness?

Next steps on headless Chromium:

ovasileva closed this task as Resolved.Sep 22 2017, 10:20 AM
ovasileva claimed this task.

Cool benchmarks!

I'm just passing by, but wanted to drop a couple of notes (sorry if they are obvious):

  • I believe we should assume that the browser is already launched and there is only one per process, it is pages what we create->render->destroy per request.
    • Maybe then, we should time before browser.newPage() until after await page.pdf using something like console.time. Not sure if it will make any difference, but it may be interesting to do.

Looking forward to more tests/benchmarks, I think this is a very interesting topic.