Page MenuHomePhabricator

How should we get Chromium for use in puppeteer?
Closed, ResolvedPublic

Description

The chromium-render service depends on puppeteer, which in turn depends on having a Chromium binary available for use. Unless told otherwise, puppeteer downloads the Chromium version it needs on first launch. We can tell puppeteer to not download Chromium and use some other binary, perhaps the one that comes with the distribution. Some concerns against using the bundled Chromium in the service were raised in T178189#3692824. So we looked into using a version of Chromium that comes with the distribution.

However, the puppeteer documentation warns against using versions of Chromium that doesn't come with pupeeteer:

NOTE Puppeteer works best with the version of Chromium it is bundled with. There is no guarantee it will work with any other version. Use executablePath option with extreme caution. If Google Chrome (rather than Chromium) is preferred, a Chrome Canary or Dev Channel build is suggested.

Source: https://github.com/GoogleChrome/puppeteer/blob/v0.11.0/docs/api.md#puppeteerlaunchoptions

I wonder whether this a good reason to not use the Debian version of Chromium.

Also, the latest Debian Jessie has the Chromium version 57.0.2987.98-1~deb8u1, and the headless Chromium first appeared in versoin 59. Does that mean we should compile our own version of Chromium if we want to avoid the puppeteer's version? Wouldn't it defeat the purpose of getting free security fixes from the Debian package maintainers?

Also, I created a proof of concept patch that uses the distribution's Chromium, except the patch doesn't work and puppeteer warns against using non-bundled Chromium: https://gerrit.wikimedia.org/r/385044.

Given the above, would it make sense to stick to the version of Chromium provided by puppeteer?

If you're interested in the full context, please read T178189: [spike] Temporarily allow pushing large objects.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 19 2017, 12:46 PM
bmansurov renamed this task from How should we get Chromium for use in puppeteer? to [subtask] How should we get Chromium for use in puppeteer?.Oct 19 2017, 7:00 PM

I think there are a few things at play here:

  • How do we distribute chromium to the servers in the cluster efficiently?
  • How do we ensure security upgrades happen in a timely manner for this component? The team maintaining it will need to set up a process for this (following puppeteer releases, and upgrade in a timely manner)
  • How do we download chromium in the fist place in a verifiable way? I've checked puppeteer and it doesn't do any form of checksum of the downloaded zip file or anything, and I can't find checksums of the zip files on the chromium releases website

The reason why I first suggested to use the deb package is that debian has already solved all of the problems above for us. @MoritzMuehlenhoff what do you think? @bmansurov (or whoever else in the team): what are your plans regarding my second question?

Slight race condition here :-) I had just followed up on https://phabricator.wikimedia.org/T178189#3698691 for this

I think there are a few things at play here:

  • How do we distribute chromium to the servers in the cluster efficiently?

Ideally via a deb, then we have all the usual benefits of Secure apt and we can use our usual tooling for updates. Copying my comment from https://phabricator.wikimedia.org/T178189#3698691:

There are currently no Chromium packages for jessie, the maintainer has asked for a volunteer to build/test the stretch packages for jessie and while there was a a volunteer, this apparently didn't happen so far:
https://lists.debian.org/debian-security/2017/08/msg00010.html

It's fairly complex endeavor to follow Chromium over more than the usual two years of lifetime of a Debian stable release (before the next one is released), since they're updating build dependencies pretty quickly (e.g. on wheezy Chromium needed to be end-of-lifed in advance as well since they started to use C++ features which were not yet supported in the GCC C++ compiler in Debian wheezy)

If necessary we could chime in for the jessie builds, though. (But the proper fix would be to run this service on stretch).

  • How do we ensure security upgrades happen in a timely manner for this component? The team maintaining it will need to set up a process for this (following puppeteer releases, and upgrade in a timely manner)

I think the canonical announcement resource for this is https://chromereleases.googleblog.com/

phuedx added a comment.EditedOct 20 2017, 12:32 PM

Shower thought:

It seems like the use of the puppeteer library is introducing a lot of complexity around how we can deploy this service securely and efficiently - two things that we must not trade off against.

The library:

OTOH there's nothing to stop us from launching a Chromium process ourselves and using command line switches to make it save the page as a PDF: https://peter.sh/experiments/chromium-command-line-switches/#print-to-pdf (this list is linked to from https://www.chromium.org/developers/how-tos/run-chromium-with-flags).

Before we go any further investigating how we can best support using the puppeteer library, we should first revalidate whether we should use it in light of all of this recent (both productive and enlightening!) discussion.

How do we ensure security upgrades happen in a timely manner for this component? The team maintaining it will need to set up a process for this (following puppeteer releases, and upgrade in a timely manner)

I think in the beginning Readers Web can commit to creating the necessary documentation and following it to update puppeteer in a timely manner. I guess we'll subscribe to their mailing list or something to know about new versions. Later, when Readers Infrastructure takes over the project, I hope they will continue maintaining upgrades.

OTOH there's nothing to stop us from launching a Chromium process ourselves and using command line switches to make it save the page as a PDF: https://peter.sh/experiments/chromium-command-line-switches/#print-to-pdf (this list is linked to from https://www.chromium.org/developers/how-tos/run-chromium-with-flags).

There's a warning at the link you shared:

It is important to note that using these switches is not supported or recommended. They should only be used for temporary cases and may break in the future.

Also, I think not all flags are exposed. For example, when I was playing with it, I didn't find an option to not print page headers (like timestamp and the URL) and footers.

OTOH there's nothing to stop us from launching a Chromium process ourselves and using command line switches to make it save the page as a PDF: https://peter.sh/experiments/chromium-command-line-switches/#print-to-pdf (this list is linked to from https://www.chromium.org/developers/how-tos/run-chromium-with-flags).

Before we go any further investigating how we can best support using the puppeteer library, we should first revalidate whether we should use it in light of all of this recent (both productive and enlightening!) discussion.

If that's feature-equivalent to puppeteer, that sounds like the best solution to me. Also in terms of updates, that's far more light-weight; Debian would release the updates and all we'd need to do after new releases is to ensure that no regressions (or intentional changes in the headless mode) happened.

Joe added a comment.Oct 20 2017, 2:15 PM

OTOH there's nothing to stop us from launching a Chromium process ourselves and using command line switches to make it save the page as a PDF: https://peter.sh/experiments/chromium-command-line-switches/#print-to-pdf (this list is linked to from https://www.chromium.org/developers/how-tos/run-chromium-with-flags).

Before we go any further investigating how we can best support using the puppeteer library, we should first revalidate whether we should use it in light of all of this recent (both productive and enlightening!) discussion.

If that's feature-equivalent to puppeteer, that sounds like the best solution to me. Also in terms of updates, that's far more light-weight; Debian would release the updates and all we'd need to do after new releases is to ensure that no regressions (or intentional changes in the headless mode) happened.

I don't think it is, to be honest. Unless puppeteer works as a glorified shellout, which I don't think is the case from the code I read, it is valuable as it can manage better failures, preforking/managing the state of the browser, etc.

phuedx added a comment.EditedNov 3 2017, 2:35 PM
  • How do we download chromium in the fist place in a verifiable way? I've checked puppeteer and it doesn't do any form of checksum of the downloaded zip file or anything, and I can't find checksums of the zip files on the chromium releases website

I started out today with the intention of filing an upstream bug with the GoogleChrome/puppeteer library to ask about getting a checksum uploaded with each Chromium rev added to the backing Google Cloud Storage bucket. However, after some investigation, I don't think that this is required and only a minor change to the downloader component is required.

The store has a public API for retrieving base64-encoded MD5 and CRC32 checksums for any/all files. If you make a HEAD request for the file, then one or many X-Goog-Hash headers are returned, e.g.

curl --head https://storage.googleapis.com/chromium-browser-snapshots/Linux_x64/513435/chrome-linux.zip

<snip />

x-goog-hash: crc32c=zKTyig==
x-goog-hash: md5=L9qX74VF9LfFx6+W+NsSjw==

Indeed, the downloader component already makes this request but focusses solely on the HTTP status code of the response.

@Joe: Firstly, would verifying the file (downloaded via HTTPS) against the CRC32/MD5 checksum fetched from the same domain be enough here? If so, then we can write a shell script to do it ourselves and/or submit a PR upstream.

Edit

Here's a link to the Google Storage docs that cover the checksums that they provide: https://cloud.google.com/storage/docs/hashes-etags

Joe added a comment.Nov 3 2017, 2:54 PM

@phuedx no, getting the headers while you are downloading would not be enough, you would need to supply your script with a non-tamperable checksum (so I'd say at least sha256) of the file, at the very least.

This would be easy to do, but I'm not keen on having to download chromium every time one does a deploy on tin, even if the download is controlled and all.

So we need to find a solution to this issue: how do we distribute lage binary blobs across our production cluster? I don't think anyone solved this problem up to now.

Before someone else proposes it, git-fat via archiva is *not* a feasible option; although git-fat itself could be.

I would ask advice to the Release-Engineering-Team on how to distribute a large binary blob.

Joe added a comment.Nov 3 2017, 3:07 PM

Please note that my biggest concern here is the security one. Citing myself:

  • How do we ensure security upgrades happen in a timely manner for this component? The team maintaining it will need to set up a process for this (following puppeteer releases, and upgrade in a timely manner)

This is not a small undertaking, and it also requires puppeteer to be updated regularly as chromium vulnerabilities come to light.

Is that happening? Are we committed to follow constantly their releases? What will happen to the service if the team maintaining it gets disbanded or abandons the project for some reason?

These are all problems that would be solved by using a version of chromium coming from debian itself. Are you absolutely sure that's not possible even if it means using a slightly outdated version of puppeteer? I would prefer that option over having to maintain the security of chromium ourselves.

mark added a subscriber: mark.Nov 3 2017, 3:12 PM
greg added a subscriber: greg.Nov 3 2017, 5:22 PM

I would ask advice to the Release-Engineering-Team on how to distribute a large binary blob.

That's T171758: Support git-lfs files in gerrit

This comment was removed by Krinkle.
bmansurov removed bmansurov as the assignee of this task.Nov 15 2017, 6:17 PM
bmansurov renamed this task from [subtask] How should we get Chromium for use in puppeteer? to How should we get Chromium for use in puppeteer?.Nov 15 2017, 6:19 PM
bmansurov added a project: Spike.
bmansurov removed a subscriber: bmansurov.
Krinkle removed a subscriber: Krinkle.Nov 15 2017, 8:40 PM
phuedx claimed this task.Nov 20 2017, 6:09 PM
phuedx closed this task as Resolved.Nov 21 2017, 8:49 AM

Being bold.

I'll be creating a higher-level "Deploy the service" task that summarises the outcome of this task and T178189: [spike] Temporarily allow pushing large objects later.

^ For context:

The conversation around this problem forked between this task and T178189: [spike] Temporarily allow pushing large objects. In the latter, we (Readers Web) and Ops investigated and concluded that puppeteer can drive a packaged version of Chromium on Debian Stretch.

The full investigation and notes are captured in T180037: [Spike] Can the new render service run on Debian Stretch?.