Page MenuHomePhabricator

Add code snippets with integrity hashes to CDNJS interface
Open, MediumPublic

Description

(Prompted by discussion at Blog Post: Toolforge provides proxied mirrors of cdnjs and now fontcdn, for your usage and user-privacy.)

Upstream CDNJS supports (e.g.) copying-and-pasting of full <link /> and <script /> elements for the hosted libraries, including with integrity attributes.

The interface to the Toolforge CDNJS service should also provide these snippets.

The hashes can be generated with e.g.:

cat bootstrap.min.css | openssl dgst -sha384 -binary | openssl base64 -A

And displayed like (linebreaks added here for easier reading):

<link rel="stylesheet"
      href="https://tools-static.wmflabs.org/cdnjs/ajax/libs/twitter-bootstrap/3.3.6/css/bootstrap.min.css"
      integrity="sha384-1q8mTJOASx8j1Au+a5WDVnPi2lkFfwwEAa8hDDdjZlpLegxhjVME1fgjWPGmkzs7"
      crossorigin="anonymous" />

Event Timeline

In J65#654, @bd808 wrote:

@Samwilson this sounds like something worth opening a task for. We don't run the same frontend or build script as the upstream, so what needs to be done here is to update our cdnjs-packages-gen script to compute the hashes when we update the repo mirror and then update rTCJS labs/tools/cdnjs-index to display the hashes somehow. We don't have a specific Phab project for cdnjs, so just tag it with Tools.

The current frontend (rTCJS) is pretty bare bones. It's a single static page that is generated from the combined json manifest that cdnjs-packages-gen creates following each git pull of the upstream repo. Generating the sha384 hashes of each artifact will take some time, but shouldn't be too horrible to add into the cdnjs-packages-gen script so that they are available in some file for use from the UI. Upstream does this by making a parallel directory tree that follows the layout of the libs directory with one json file per version for each library.

The real trick will be figuring out how we can nicely display those hashes for people to use them. I have personally been using the upstream UI to get the proper hash. That works, but is obviously not very end-user friendly.

Could we just scrape the upstream page and extract the hashes or whole snippets from there? Or is that juts a bit too hacky? :-)

Could we just scrape the upstream page and extract the hashes or whole snippets from there? Or is that juts a bit too hacky? :-)

The hard part isn't generating the hashes, its creating a reasonable UI to display them. The current UI is really just a pretty splash page in front of nginx generated directory listings. This is going to need to be rethought to include data that is not present in the upstream repo. Basically someone is going to have to adopt this project and rethink the entire browsing interface. That might mean replacing the current UI with custom code or figuring out how to fork the upstream UI and make the appropriate branding and content changes.

Is there a reason we have a full clone/mirror of cdnjs? Did we consider using a proxy like we do for the fonts?

Dunno why, but the relevant ticket is T96799

I think (might be wrong) yuvi said something about serving from local disk being super-fast compared to /static serving from NFS

Is there a reason we have a full clone/mirror of cdnjs? Did we consider using a proxy like we do for the fonts?

Hosting it ourselves seems nicer than leeching off of the upstream. I would have suggested local hosting for the fonts too if there was a readily available git repo or other means to mirror the content.

Since we now actually do proxy the source instead of hosting the repo, should we close this @bd808 and @Samwilson?

Since we now actually do proxy the source instead of hosting the repo, should we close this @bd808 and @Samwilson?

We are still generating our own splash page. Unless we replace that I think the basic issue that @Samwilson is raising here remains. Its not that you can't go to the upstream to find the hashes, but if you just visit https://tools.wmflabs.org/cdnjs/ we do not provide them. As I recall the hashes are actually not part of the https://cdnjs.com/api output, so I'm not sure we actually can include them with the current setup. I think to get them we would actually have to do something like scrape the actual cdnjs UI or setup a copy of https://github.com/mozilla/srihash.org to generate them on the fly for folks.

I'm actually not sure that the later option of local hash generation would really ensure any integrity. As a corollary, I'm not sure that the hashes provided by cdnjs directly provide much integrity either unless you actually code review the script that you download after verifying that it matches the hash. The hash does ensure that the file is repeately the same as the version that was hashed, but it doesn't magically ensure that nobody has tampered with a trusted original until you actually establish trust of the original. This is kind of the chicken and egg problem of distributed trust.

I'm actually not sure that the later option of local hash generation would really ensure any integrity. As a corollary, I'm not sure that the hashes provided by cdnjs directly provide much integrity either unless you actually code review the script that you download after verifying that it matches the hash. The hash does ensure that the file is repeately the same as the version that was hashed, but it doesn't magically ensure that nobody has tampered with a trusted original until you actually establish trust of the original. This is kind of the chicken and egg problem of distributed trust.

Integrity hashes are supposed to prevent against CDN compromise[1] pwning your website. I think it's still useful even if you don't fully audit the source code (mostly impossible with minification) because you can audit the functionality (check external network requests, etc.). And since all the URLs are versioned, the hashes should never ever change.

Maybe there's another cdnjs mirror that we can verify hashes against to identify compromise of the main cdnjs instance that we proxy?
`
[1] https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity#How_Subresource_Integrity_helps

This is a humbug answer, but ... I understand how the hashes work technically on the browser side, but as the published hashes at cdnjs are generated by them from the files that they mirror by submodule inclusion from a wide variety of upstream git repos I actually don't see what they do to actually establish end-to-end trust between the original code maintainer and the eventual downstream consumer. If the hashes were calculated by the original maintainer and published in some way that integrity of that published list could be verified then there would be an end-to-end contract. The current state of the cdnjs provided hashes is really questionable for end-to-end security except as you say to validate that the file fetched from cdnjs from day to day has not changed. This guarantee is weak as it provides no validation of the content of the source file at the time that the hash was computed by a 3rd party (cdnjs) and then pasted into the HTML source of an embedding page. If the file was pwned prior to the day you copied the hash from their ui you are just verifying that your visitors continue to download and execute the pwned file. If cdnjs themselves published the integrity hashes in a verifiable way you could gain a small bit more trust that the files you are downloading met whatever process they use for ensuring integrity of their initial mirrors. This still however would only verify that additional tampering did not occur after the point at which they computed and published that hash.

Bstorm triaged this task as Medium priority.Feb 11 2020, 5:24 PM

This is a humbug answer, but ... I understand how the hashes work technically on the browser side, but as the published hashes at cdnjs are generated by them from the files that they mirror by submodule inclusion from a wide variety of upstream git repos I actually don't see what they do to actually establish end-to-end trust between the original code maintainer and the eventual downstream consumer. If the hashes were calculated by the original maintainer and published in some way that integrity of that published list could be verified then there would be an end-to-end contract. The current state of the cdnjs provided hashes is really questionable for end-to-end security except as you say to validate that the file fetched from cdnjs from day to day has not changed. This guarantee is weak as it provides no validation of the content of the source file at the time that the hash was computed by a 3rd party (cdnjs) and then pasted into the HTML source of an embedding page. If the file was pwned prior to the day you copied the hash from their ui you are just verifying that your visitors continue to download and execute the pwned file. If cdnjs themselves published the integrity hashes in a verifiable way you could gain a small bit more trust that the files you are downloading met whatever process they use for ensuring integrity of their initial mirrors. This still however would only verify that additional tampering did not occur after the point at which they computed and published that hash.

Personally, “the file fetched from cdnjs from day to day has not changed” is all I would expect from such a feature. I doubt there’s much more that we can realistically verify; for “whatever process they use for ensuring integrity of their initial mirrors”, I expect there’s not much more than “we connect to npmjs.com via HTTPS” for most libraries. (Some libraries are configured to auto-update from git://github.com/ URLs instead of npm packages, which probably means their auto-update has been broken since GitHub disabled unauthenticated git://.)

But the SRI hashes are part of the cdnjs/logs repository (example), and because each version gets a new directory, the commits in that repository should only ever add new files, never modify them. So I think we could maintain a clone of that repository, check that any pull is a fast-forward (no upstream history rewrites), and check that there are no file changes other than added files (git log --diff-filter=a should return no commits). That ought to establish that cdnjs aren’t changing the SRI hashes, for relatively little disk space compared to copying all the files themselves (a local clone of the logs repo weighs in at 7.0 GB / 6.5 GiB at the moment).

So I think we could maintain a clone of that repository

We used to work from a clone of the repo, but that was removed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/413197 as we were having a lot of issues keeping the clone working (T182604: tools-static is throwing space warnings due to cdnjs git repo size).

The SRI hashes are exposed on the upstream UI and can be used via our transparent proxy. The UI built by https://gerrit.wikimedia.org/r/plugins/gitiles/labs/tools/cdnjs-index/ could probably be updated to show hashes retrieved from the upstream API if someone can figure out a reasonable way to display them. We are already fetching that data during the build (https://api.cdnjs.com/libraries/codex?fields=assets), we just are not using it at the moment.

But that was a clone of cdnjs/cdnjs, right? cjnds/logs is much smaller.

But that was a clone of cdnjs/cdnjs, right? cjnds/logs is much smaller.

Yes, the repo that grew too big for us to clone easily was cdnjs/cdnjs. Sorry for the confusion I injected there.

No problem :) I was under the impression the discussion here had stalled over the “quality of the hashes” point, but maybe that’s not the case in the first place. (I might look into how to implement this later, but for now I’m looking at T342519 first.)