[Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	• bmansurov
	Sep 13 2017, 6:37 PM

Description

We want to see if headless Chromium is a viable solution for rendering books of various sizes.

Generate PDFs of books with the following number of pages (approximately):

10
25
100
1,000
2,500
5,000
10,000

And measure the following for each book:

CPU usage;
Memory usage;
Time spent.

Make sure to render each type of book multiple times and take the average of measurements.

A/C

Settle on a Python wrapper for headless Chromium and document the decision;
Create a simple Python service that takes a URL of a page and generates a PDF from the contents of that URL; share the source code in this task!
Upload the measurements from above; Don't forget to include your environment setup too.
Upload the resulted PDFs here.

Related Objects

Mentioned In: T178278: Performance test the service
T178189: [spike] Temporarily allow pushing large objects
T167955: Create PDF styles for books
T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Mentioned Here: T176466: Update books workflow to account for longer render times
T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
P6015 T175853

Event Timeline

• bmansurov created this task.Sep 13 2017, 6:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 13 2017, 6:37 PM

I assume you mean Chromium? Google Chrome isn't free software.

Yep, will fix it.

• bmansurov renamed this task from Investigate the ability of Python wrapped headless Chrome to render large books to Investigate the ability of Python wrapped headless Chromium to render large books.Sep 13 2017, 6:39 PM

• bmansurov added projects: Web-Team-Backlog, Readers-Web-Kanbanana-Board-Old.

• bmansurov updated the task description. (Show Details)

• bmansurov moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.

• bmansurov added subscribers: ovasileva, phuedx.

@bmansurov: Is the 2500 number grounded in anything? If not, I suggest increasing the scope of the spike to include profiling CPU utilisation, memory consumption, and wall time as a function of book size.

No, I remember the number came up during our meeting earlier. Your suggestion sounds good.

• bmansurov updated the task description. (Show Details)Sep 13 2017, 6:47 PM

• bmansurov updated the task description. (Show Details)Sep 13 2017, 6:53 PM

ovasileva renamed this task from Investigate the ability of Python wrapped headless Chromium to render large books to [Spike] Investigate the ability of Python wrapped headless Chromium to render large books.Sep 13 2017, 6:54 PM

ovasileva triaged this task as High priority.

ovasileva added projects: Spike, Proton.

ovasileva moved this task from Incoming to 2017-18 Q1 on the Web-Team-Backlog board.Sep 13 2017, 6:57 PM

ovasileva moved this task from Triage to Current Sprint on the Proton board.Sep 13 2017, 7:08 PM

• Niedzielski updated the task description. (Show Details)Sep 14 2017, 5:15 PM

ovasileva renamed this task from [Spike] Investigate the ability of Python wrapped headless Chromium to render large books to [Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books.Sep 14 2017, 5:15 PM

• bmansurov updated the task description. (Show Details)Sep 14 2017, 6:00 PM

• bmansurov claimed this task.Sep 14 2017, 6:37 PM

• bmansurov moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.

There is an official Node API for headless Chromium [1] called puppeteer and
there is a unofficial Python port [2] called pyppeteer. The Python port
seems to be actively developed. However, printing capabitlity looks
unfinished [3] and doesn't work (times out with RESTBase URLs).

The printing options such as disabling the automatic header and footer are not
exposed to the command line either [4].

Our only option seems to use puppeteer (i.e. the Node version), which means
that in addition to creating a service for headless Chromium, we'll have to
create a new service for PDF post-processing in Python. @phuedx, @ovasileva
what do you think? Should we change this task to use Node.js?

[1] https://github.com/GoogleChrome/puppeteer
[2] https://github.com/miyakogi/pyppeteer
[3] https://miyakogi.github.io/pyppeteer/_modules/pyppeteer/page.html#Page.pdf
[4] https://bugs.chromium.org/p/chromium/issues/detail?id=603559

Setup

System

$ uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u3 (2017-08-15) x86_64 GNU/Linux

CPU

$ cat /proc/cpuinfo | grep "MHz\|cores"
cpu MHz		: 2807.948
cpu cores	: 1

Memory

$ free -m
             total       used       free     shared    buffers     cached
Mem:         12043        246      11797          8          9        152
-/+ buffers/cache:         84      11959
Swap:          382          0        382

HTML

Script that downloads articles and creates a combined HTML: P6015 Modifications of the script has been used to create books of various sizes.

Measurement

Puppeteer downloads a version of Chromium for internal use. Initially I had a hard time getting pupputeer working locally as it was complaining about sandboxing issues. So I used the the version of chromium that it has to manually render PDFs. After the measurements, I gave puppeteer another try and I got it working too. I had to pass some flags in order to get it working. Here is the script I used:

const puppeteer = require('puppeteer');

(async () => {
	const browser = await puppeteer.launch({args: [ '--headless', '--disable-gpu', '--no-sandbox', '--disable-setuid-sandbox']});
	const page = await browser.newPage();
	await page.goto('http://mediawiki_local/w/capitals.html', {waitUntil: 'networkidle'});
	await page.pdf({path: 'capitals.pdf', format: 'A4'});

	browser.close();
})();

I also ran a couple of smaller tests and the \time results were comparable to what I got from directly running headless chromium.

The /usr/bin/time command has been used for measuring time spent, CPU, and memory usage. The command looks something like this:

\time -f "Elapsed real time: %E\nCPU percentage used: %P\nMaximum resident set size (RSS in Kbytes): %M" node_modules/puppeteer/.local-chromium/linux-497674/chrome-linux/chrome --headless --disable-gpu --no-sandbox --print-to-pdf http://mediawiki.local/w/book.html

Each test has been carried out 3 times.
/usr/bin/time version:

$ \time --version
GNU time 1.7

Although 'CPU percentage used' reported by \time is a low number, a subprocess spawned by the main process is always occupying more than 90% CPU time and in most cases the number is even 100%. This has been verified by the top command. One consequence of this is that if multiple renders occur at the same time, the reported times below will increase. Those times were measured when only one render process was being executed.

Results

(size in parenthesis)

10 page PDF (2M - F9654049)

**Average**
Elapsed real time: 0:02
CPU percentage used: 17%
Maximum resident set size (in Mbytes): 88

**Detailed**
Elapsed real time: 0:02.02
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 88628

Elapsed real time: 0:02.00
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 88912

Elapsed real time: 0:02.02
CPU percentage used: 17%
Maximum resident set size (in Kbytes): 87456

23 page PDF (4M - F9654060)

**Average (rounded)**
Elapsed real time: 0:05
CPU percentage used: 12%
Maximum resident set size (in Mbytes): 111

**Detailed**
Elapsed real time: 0:04.73
CPU percentage used: 13%
Maximum resident set size (in Kbytes): 110968

Elapsed real time: 0:04.63
CPU percentage used: 12%
Maximum resident set size (in Kbytes): 110536

Elapsed real time: 0:04.50
CPU percentage used: 12%
Maximum resident set size (in Kbytes): 111320

157 page PDF (21M - Couldn't upload as Phabricator has a limit of 10M)

**Average**
Elapsed real time: 0:33
CPU percentage used: 8%
Maximum resident set size (in Mbytes): 277

**Detailed**
Elapsed real time: 0:31.30
CPU percentage used: 8%
Maximum resident set size (in Kbytes): 276964

Elapsed real time: 0:30.51
CPU percentage used: 7%
Maximum resident set size (in Kbytes): 277468

Elapsed real time: 0:33.85
CPU percentage used: 8%
Maximum resident set size (in Kbytes): 277844

**Unrelated**
# So I wanted to see how running 3 jobs at the same time would affect the running time. Here are the results.
# In the above cases when each job is executed sequentially, we'd spend about 1 min 35 secs, and here all 3 jobs running at the same, finished slightly later.
Elapsed real time: 1:44.29
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 275132

Elapsed real time: 1:43.80
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 275320

Elapsed real time: 1:42.82
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 274848

504 page PDF (67M)

**Average**
Elapsed real time: 1:51
CPU percentage used: 6%
Maximum resident set size (in Mbytes): 732

**Detailed**
Elapsed real time: 1:50.02
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 732332

Elapsed real time: 1:51.88
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 730944

Elapsed real time: 1:50.62
CPU percentage used: 6%
Maximum resident set size (in Kbytes): 731472

995 page PDF (128M)

**Average**
Elapsed real time: 5:10
CPU percentage used: 4%
Maximum resident set size (in Mbytes): 1310

**Detailed**
Elapsed real time: 5:02.61
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1310288

Elapsed real time: 5:09.63
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1310420

Elapsed real time: 5:19.08
CPU percentage used: 4%
Maximum resident set size (in Kbytes): 1309904

2419 page PDF (192M)

**Average**
Elapsed real time: 24:45
CPU percentage used: 2%
Maximum resident set size (in Mbytes): 1902

**Detailed**
Elapsed real time: 30:53.04
CPU percentage used: 2%
Maximum resident set size (in Kbytes): 1902136

Elapsed real time: 22:07.00
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 1902464

Elapsed real time: 21:17.78
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 1902696

4837 page PDF (269M)

**Average**
Elapsed real time: 1:36:37
CPU percentage used: 1%
Maximum resident set size (in Mbytes): 2613

**Detailed**
Elapsed real time: 1:49:50
CPU percentage used: 1%
Maximum resident set size (in Kbytes): 2612624

Elapsed real time: 1:17:13
CPU percentage used: 1%
Maximum resident set size (in Kbytes): 2613088

Elapsed real time: 1:44:53
CPU percentage used: 2%
Maximum resident set size (RSS in Kbytes): 2612392

10,000 page PDF

Haven't attempted rendering a PDF this big as (a) it takes a lot of time; (b) it seems it won't fail given how much memory the ~5,000 page PDF has used; (c) it's unlikely someone will render this big of a book, because that will mean they'll have to add roughly a 1,000 articles to the book.

Conclusions

Headless Chromium is able to render books with thousands of pages. It just takes a long time. We'll need machines with good CPU power if we want to speed up the render time. We'll have to work with a designer to change the UI (for example, by limiting the number of articles one can add to a book) and redo the back-end to render PDFs as queued jobs and notify the user via an email or an echo notification when their PDF is ready.

In T175853#3616304, @bmansurov wrote:

Measurement

Puppeteer downloads a version of Chromium for internal use. I had a hard time getting pupputeer working locally as it was complaining about sandboxing issues. So I used the the version of chromium that it has to manually render PDFs.

Do you have any notes around this? If we're planning on switching out Electron for headless Chromium, then the more foreknowledge we have, the better.

Do you have any notes around this? If we're planning on switching out Electron for headless Chromium, then the more foreknowledge we have, the better.

Updated the initial comment.

• bmansurov removed • bmansurov as the assignee of this task.Sep 19 2017, 8:16 PM

Conclusions

Headless Chromium is able to render books with thousands of pages. It just takes a long time. We'll need machines with good CPU power if we want to speed up the render time. We'll have to work with a designer to change the UI (for example, by limiting the number of articles one can add to a book) and redo the back-end to render PDFs as queued jobs and notify the user via an email or an echo notification when their PDF is ready.

Not too bad in terms of results. I would say our next step would be to meet and discuss all the options ^

closing this for now. Next steps for post-processing and headless chrome to be tracked in T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies

In T175853#3620804, @ovasileva wrote:

Not too bad in terms of results. I would say our next step would be to meet and discuss all the options ^

Could you add a brief summary of this meeting for clarity/completeness?

Next steps on headless Chromium:

test rendering with single-article PDFs
redesign workflow for book download to account for longer download times T176466: Update books workflow to account for longer render times
investigate ways of combining headless Chromium and post-processing, look into whether we can post-process with node.js T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
once OCG is retired and the above tasks completed, create a task investigating hardware requirements

ovasileva closed this task as Resolved.Sep 22 2017, 10:20 AM

ovasileva claimed this task.

Cool benchmarks!

I'm just passing by, but wanted to drop a couple of notes (sorry if they are obvious):

I believe we should assume that the browser is already launched and there is only one per process, it is pages what we create->render->destroy per request.
- Maybe then, we should time before browser.newPage() until after await page.pdf using something like console.time. Not sure if it will make any difference, but it may be interesting to do.

Looking forward to more tests/benchmarks, I think this is a very interesting topic.

• bmansurov mentioned this in T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies.Sep 22 2017, 2:14 PM

• Niedzielski mentioned this in T167955: Create PDF styles for books.Sep 22 2017, 2:39 PM

• bmansurov mentioned this in T178189: [spike] Temporarily allow pushing large objects.Oct 18 2017, 1:23 PM

• bmansurov mentioned this in T178278: Performance test the service.Nov 29 2017, 7:40 PM

Aklapper removed a subscriber: Web-Team-Backlog.May 16 2023, 10:28 AM

Restricted Application added a project: Product-Infrastructure-Team-Backlog-Deprecated. · View Herald TranscriptMay 16 2023, 10:28 AM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald Transcript

	F9654049: T175853 - 10 page pdf
	Sep 19 2017, 8:14 PM

[Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large booksClosed, ResolvedPublicSpikeActions

Description

A/C

Related Objects

Event Timeline

Setup

Results

Conclusions

Conclusions

[Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books
Closed, ResolvedPublicSpike
Actions