Page MenuHomePhabricator

implement Public Key Pinning (HPKP) for Wikimedia domains
Closed, DeclinedPublic

Description

Send the Public Key Pinning (HPKP) header for Wikimedia domains. (Firefox and Chrome support this now.) I suppose we would want to pin the CAs we use instead of the leaf keys. Not include subdomains at first. Implementing a URI to report failures to would be nice. Start with a domain that is not widely used.

The important question that needs to be answered first: are we able to know which CAs we will be using in the next 6 month (in advance)? (Or if we pin our leaf keys can we manage to create private keys 6 month in advance?)

https://en.wikipedia.org/wiki/HTTP_Public_Key_Pinning

Event Timeline

JanZerebecki raised the priority of this task from to Low.
JanZerebecki updated the task description. (Show Details)
JanZerebecki added subscribers: Aklapper, fgiunchedi, drdee and 13 others.

Why not pin the leaf keys? We can create one (or many) backup keys and keep them offline. If, for whatever reasons, we need to revoke the currently used keys, we can use the backups to keep providing services. Also for the max-age, the RFC document recommends 60 days [1]. It is likely browsers have a upper limit for the max-age as well. If so, we can only use the maximum max-age.

[1] Public Key Pinning Extension for HTTP Section 4.1

I personally would prefer to pin leaf keys, but the question is what the people maintaining the certificates can offer.

If it's helpful, we got a pretty good perspective from the authors of the HPKP spec on how to think about pinning, on this GitHub thread:

https://github.com/SSLMate/sslmate/issues/10

I think the ideal situation we should aim for in the long term is that we'd sign 3x leaf keys roughly as discussed in various recommendations:

  1. The one currently in use, signed by our current CA and deployed on all our servers
  2. A backup key which is already signed by our current CA and ready to go, but whose private key is not deployed on our servers, and is instead stored offline from production securely.
  3. Another backup key which hasn't even been signed: has never seen the light of day outside of our secure offline storage.

This gives us a the ability to recover from a compromised key without any external delays for new signing processes by switching to key 2 after addressing the compromising attack itself (we don't have this capability now regardless of HPKP), and the 3rd cert can be promoted to slot 2 and signed in the meantime, and also be ready for use within the HPKP max-age from the original compromise. At that point we'd generate a new third cert, which we can't really use until max-age time has passed. From this sort of thinking, we can set some reasonable bounds on risks vs max-age. The same basic process would apply if we had a non-compromise reason to replace our private keys (e.g. to upgrade to a longer key length).

In a scenario where we want to change CAs while already operating under this scheme: if our private keys are not compromised, we can have the new CA sign the old keys without affecting HPKP at all (it hashes the actual public key, not the signed cert).

The costs/tasks to us are (a) Doubling our usual recurring costs on cert signing, unless our CA is gracious enough to single-charge us for duplicate key-signing over the same domainnames and timeframes in support of HPKP, and (b) implementing a storage mechanism for the inactive private keys that's both secure and reasonable for operations to access relatively-quickly. There are lots of tricky ideas to work through there.


However, in the interim, I think pinning a few roots might be an option as well, as discussed in the SSLMate ticket linked earlier. We could do a relatively-short-term (say, 2 weeks) pin on our current CA's root as well as 1-3 others that we think are viable alternative issuers (meet all of our basic policy and technical requirements - this is non-trivial and excludes many!). It doesn't cost us a lot of operational risk if the selection criteria for the alternates is sound, and it still thwarts a lot of potential attacks via other compromised and/or untrustworthy CAs.

I don't think we should pin specific certificates. It sounds way too risky to me, even with backup keys, and the benefit compared to pinning roots is small: in the incredibly unlikely event that GlobalSign gets compromised, we can be very sure that it will be noticed immediately and browsers will revoke it from their stores quickly.

I think pinning roots could be interesting though, for the reasons you already mentioned. We could pin GlobalSign and DigiCert, with both of which we have a contractual relationship with and we know can issue the "unified" certificate for us, plus a couple of more as backups (Symantec/Verisign? RapidSSL? Comodo?).

I still think pinning a set of certs is doable and even safer, but perhaps we can re-debate that later on, when we have some better infrastructure for storing and distributing keys (which we'll likely be building regardless), and perhaps modulo future changes to CA-related things...

For now we agree on the roots thing at least, and that brings us a fair increase in TLS strength, so let's make this ticket about that. We can do a followup at a later point in time about whether we want to upgrade to some sort of more-hardcore strategy when/if it makes sense. I know we've used RapidSSL in the past, so I guess they must be contractually-ok as well, although I don't know if they meet our current tech requirements on the SNI/unified/wildcard front. We don't want to bloat responses with too many pinned roots, either. Maybe just GlobalSign + DigiCert + RapidSSL? I'd think 3x CAs with a max-age of 2 weeks is still pretty conservative in terms of breakage/inflexibility risks.

The tricky part is finding CAs that are able to issue our unified certificate. I don't think RapidSSL can do that but we'd have to double-check.

We've chatted about this off and on via IRC. My current thinking (which isn't all that different from above is):

  • Definitely - we should turn on HPKP ASAP with 2-3 root-level CAs. We'd start with a very small max-age just to ensure no unexpected issues, and then gradually up the max-age as appropriate (but probably never more than a month or so, I think). One of the CAs must be our current CA: GlobalSign. Initial implementation would definitely not have includeSubdomains; neither does our HSTS for several good reasons that we can't immediately fix, sadly.
  • Fairly Sure - the second one should be DigiCert, since we've used them in the past, and still have some certs with them for other things right now, so we do have a relationship and past experience there to work with.
  • Somewhat Sure - that our third choice isn't going to matter a lot in practice and we should just pick Verisign on the grounds that they're widely used and trusted. All that's really important here is that it's someone we think is trustworthy, and that we can potentially work with in an extremely unlikely emergency situation. The only scenario in which we'd need to use the 3rd option is one in which we've decided that both of the first two are now bad actors and/or have had their root keys compromised, and we need to act on that with a cert switch in less than max-age time (implying both events happened to these unrelated CAs in rather rapid succession). Another option here is just not to use a 3rd at all, and trust that we'll never see both GlobalSign and DigiCert get compromised or go rogue within max-age, which seems like a fairly reasonable assumption if we clamp max-age down even shorter.

TL;DR:

If nobody disagrees with the above logic in general, let's choose an option here just so we can move forward:
a) [GlobalSign, DigiCert, Verisign] w/ eventual max-age = 1 month
b) [GlobalSign, DigiCert] w/ eventual max-age = 2 weeks

Either way, we can still switch course later as we continue to evaluate all of this. The limitation is that we can't effectively switch course for all clients in less than max-age from the implementation of any new policy.

Are you considering doing a phase where you do the Public-Key-Pins-Report-Only header first, to see what the likely issues would be?

It might be a good idea for the initial deploy for a day or two, yeah.

ok @faidon and I had a pages-long chat about this. I think we're going to step back and re-evaluate our options and best practices here instead of trying to boldly move forward with some ill-conceived trivial root-pinning HPKP idea. The gist of that discussion was this:

  • Choosing a small set of roots to pin via HPKP turns out to be really hard when you start trying to really do it. There are a lot of considerations about things like which roots are in old browsers' CA stores, which roots a new emergency cert would be signed by, which roots cross-sign others (keeping in mind that the pinned key must be used by the browser in validation, not merely be an alternate indirect signer), etc.
    • The "safe" way to do it would be to HPKP-pin at the CA-organizational-level, meaning we'd pin every available root cert from 2-3 CAs. That ends up being a really long and bloaty list to be including in an HPKP header; it's not really acceptable in terms of response bloat.
    • The unsafe ways are seriously problematic in multiple realistic scenarios we might have to respond to re: compromised CA private keys or orgs within any decent max-age timespan.
  • In light of all of that, it makes more sense for us to go back to the "pin 3x leaf keys" option as a target for HPKP. Leaf key pins are independent of any CA issue, as we can always have our leaf keys signed or re-signed by any CA. The only reason we have offline backups in that list at all is to deal with compromise of our own private keys.
  • However, we do eventually perhaps want browsers doing preload-PKP for us as well. Current preloads ( https://src.chromium.org/viewvc/chrome/trunk/src/net/http/transport_security_state_static.json ) seem to take the "safe" option above of pinning many roots, which doesn't have bloat problems when it's not in a header. Preloading is so potentially-long-term that I don't think we'd want to do this with leaf keys.
  • So in an ideal world, I think we'd want to preload a long CA roots list in browsers, and HPKP-pin just our 3x leaf keys on a relatively-short (say 1 month at most?) max-age timeline. The preload roots helps prevent a non-pinned rogue CA from breaking fresh clients, while the HPKP-pin provides deeper protection for regular visitors. The RFC leaves it up to UA implementors how to resolve the two sources ( https://tools.ietf.org/html/rfc7469#section-2.7 ), but I think the intention is that a valid HPKP (which implies preload-PKP success on initial hit to us) overrides the preload-PKP. Assuming that's a semi-sane end-goal, getting there looks something like this:
    1. @faidon's going to ask around a bit about best-practices elsewhere and see if that does sound sane or not, which may lead to varying this proto-plan a bit further.
    2. We'll ask @MoritzMuehlenhoff to help us set up a secure storage mechanism for backup private leaf keys (details of this probably won't be in this public ticket).
    3. For testing purposes, we'd probably trial our preloaded long list of CA roots via HPKP first and bite the bloat bullet temporarily (perhaps report-only first, and perhaps only to a statistical sample of clients), before trying to submit it to upstream browsers.
    4. Separately, we'd trial our leaf-key pinning similarly before deploying that.
    5. Any leaf-key pinning is probably best blocked until resolution of T86654, because that will likely reduce our total key count considerably and make it easier to manage (as we'll likely drop from many-SNI-certs/keys to just two: unified RSA and unified ECDSA. We wouldn't bother with SNI since ECDSA certs are so much smaller to begin with and cert size was our primary motive for SNI complexity, and very few browsers will fall into the response-size-impact window for that of supporting SNI but not ECDSA).

The more I think about it, the more I think my initial assessment was wrong and a solution where we pin 3-5 keys of our own certificates (one primary RSA, one primary ECDSA, one backup but online where the primary lies, 1-2 offline and/or gfshare'd) makes the most sense. If one gets paranoid, there are still cryptographic threats to this scenario, e.g. what if generated the keys on a compromised server or one with a broken RNG or a broken OpenSSL (think Debian circa 2007), which we could think of ways to mitigate.

The other consideration that we should have -which is a bit orthogonal to HPKP- is the time it would take us to issue a certificate if a) we get compromised b) GlobalSign gets compromised, especially for our unified certificate. The case for (b) is especially scary, considering that e.g. DigiCert needed human action to issue our special certificate. Since now that would mean a total site outage, It might be worth it to just bite the bullet and actually pay to issue a certificate from another CA for one of our backup keys.

Just thinking out loud: while we're working through all the more-complex issues about doing this "right", should we consider an interim solution that provides some small benefit and gets us some operational exposure? For instance, we could do the header only (not preload), pin to GlobalSign + a single backup CA, and ramp max-age in from initial tiny values to a maximum of, say, 24h. It would still give visitors who hit us daily some protection against a sudden MITM with compromised certs, and at 24h it doesn't hurt our CA/key agility much either. We can always turn it back down/off if we think we're making changes in the coming days/weeks, too.

How about doing "report-only" first with a longer max-age, like 7 days?

We still haven't had time to deal with HPKP properly yet, so this task has been kinda stalled-out for quite a while. While the actual HPKP pinsets (and associated cert management issues) are complex, we could more-easily go ahead and submit a browser preload pinset (as mentioned in the 4th bullet point of T92002#1388392 ) with a reasonable set of CAs we currently use or think we might use as alternates in the future. This buys us security (for PKP-preloading browsers, when the data eventually lands) against the many many other browser-trusted CAs we haven't vetted at all from issuing certificates for our domains.

You can see the static root pinsets from other major sites at the top of the json file here: https://code.google.com/p/chromium/codesearch#chromium/src/net/http/transport_security_state_static.json

What I'd propose for our usage is this set:

 "static_spki_hashes": [
    "GlobalSignRootCA",
    "GlobalSignRootCA_R2",
    "GlobalSignRootCA_R3",
    "VeriSignClass1",
    "VeriSignClass3_G4",
    "VeriSignClass4_G3",
    "VeriSignClass3_G3",
    "VeriSignClass1_G3",
    "VeriSignClass2_G3",
    "VeriSignClass3_G2",
    "VeriSignClass2_G2",
    "VeriSignClass3_G5",
    "VeriSignUniversal",
    "GeoTrustGlobal",
    "GeoTrustGlobal2",
    "GeoTrustUniversal",
    "GeoTrustUniversal2",
    "GeoTrustPrimary",
    "GeoTrustPrimary_G2",
    "GeoTrustPrimary_G3",
    "DigiCertGlobalRoot",
    "DigiCertEVRoot",
    "DigiCertAssuredIDRoot",
]

All four CAs listed here are all well-known and well-trusted in the general sense globally. In terms of being in good company, they're also common to the PKP-preload lists for other major sites. Aside from that, specific rationales for these four for us are:

  • GlobalSign - this is the current issuer of our primary unified cert, as well as multiple of our other certs.
  • Verisign - We use this currently for payments.wikimedia.org
  • GeoTrust - This is the root behind some secondary certs of ours issued via RapidSSL
  • DigiCert - We've used them in the past for primary certs of ours, and thus it's a better-known backup option in the future. We still use them for some current secondary certificates as well.

I've lightly audited all the certs in files/ssl/ in our puppet repo and they're all issued via roots in this list as well, but should more-thoroughly validate that with actual fingerprints to ensure there's not a missing root that would block a cert we're using today.

The risk in moving forward with this (which I think is acceptable) is that we're limiting ourselves to never being able to use CAs other than those in this list (or CAs cross-signed by them) for our primary domains, at least not without significant (e.g. a year or two) lead-time to get them added to the list and wait for all modern browsers to upgrade their data.

I'm open to suggestions we should make the list more inclusive by including some of the other CAs in the lists of other major sites as well (e.g. Comodo, Entrust, GoDaddy, and/or Thawte). LetsEncrypt roots might make sense as well, just so we don't lock ourselves out of migrating to that for at least some of our certs down the road. There is a bit of risk inherent in that as well, though, as they're brand-new and we don't know how the integrity of their process will work out in the real world...

@BBlack I suggest to remove at least VeriSignClass3_G2 and VeriSignClass1 from our trust list. According to [1], Class3_G2 is a 1024 bit root, and Class1 was replaced by Class1_G3 during 2010.

[1] https://www.symantec.com/page.jsp?id=roots

@Chmarkine - I'm not sure if it's wise or reasonable to remove old roots from PKP lists (except in strange policy cases like the recent one here: https://googleonlinesecurity.blogspot.com/2015/12/proactive-measures-in-digital.html), which is probably why the other major sites don't prune them much either.

The issue here is that for PKP to assert validity, it's not enough that we're signed by a CA that's on our list and in browser's store somewhere -- it has to be in the actual runtime trust path the browser actually follows, of which there could be multiple possibles, and the set could vary depending on browser and OS vendor/version.

Browsers aren't required by the standard to look hard for a match, only to look at the singular trust path they actually used. For example, due to CA cross-signing with legacy certs, etc... there could be signed trust pathways from our site certificate Cert1 through various intermediates and cross-signings to two different root certs CA1 and CA2 that are both in a given browser's trusted-roots storage. However, if the browser happened to follow the path to CA1 to validate us (what it would show in the Certification Information screen in the browser - a single path to a single root), and we only had CA2 in our list, the PKP validation could fail.

There are probably optimizations that can be made, but it's hard to know what's in every (legacy) cert store still in use for PKP-validation everywhere, and any mistake here can't really be corrected easily, so I tend to err on the side of caution. The important thing is that specifying a list here at all kills hundreds of other potentially shadier and/or less-secure CAs from the implicit and highly-variable list of "all CAs in whatever cert store the client is using".

BBlack mentioned this in Unknown Object (Task).Feb 2 2016, 2:45 PM

@BBlack, would it be possible to try Public-Key-Pins-Report-Only with a short max-age, just to see how much of an issue the cross-signing really is? I think the only downside if something goes bad is that the report-uri get hammered for up to max-age time.

Do we have a good way to report failures? I'm looking at adding CSP violation reports to mediawiki, but if there's a way to do this in varnish + hadoop, that might be better.

In general, I'd really like us to get to the point where we can pin leaf certs (help us detect if our CA is colluding with bad people, etc). But if pinning our CA's can get us experience with it, let's do it.

The issue here is that for PKP to assert validity, it's not enough that we're signed by a CA that's on our list and in browser's store somewhere -- it has to be in the actual runtime trust path the browser actually follows, of which there could be multiple possibles, and the set could vary depending on browser and OS vendor/version.

Browsers aren't required by the standard to look hard for a match, only to look at the singular trust path they actually used.

There are probably optimizations that can be made, but it's hard to know what's in every (legacy) cert store still in use for PKP-validation everywhere, and any mistake here can't really be corrected easily, so I tend to err on the side of caution. The important thing is that specifying a list here at all kills hundreds of other potentially shadier and/or less-secure CAs from the implicit and highly-variable list of "all CAs in whatever cert store the client is using".

It probably should have been stated. IMHO, if the default runtime trust path doesn't match the PKP, it should try with the next one, until no more possible paths are left.
We can't do much about existing implementations of rfc 7469 not doing this, but for PKP lists preloaded in the browser, the actual implementation is known, and we could only be included when their implementation does that. And that would benefit for everyone.

For all of the same good reasons pointed out a while back in e.g. https://blog.qualys.com/ssllabs/2016/09/06/is-http-public-key-pinning-dead , Google is putting the last nail in the coffin of [H]PKP, effectively killing it:

https://groups.google.com/a/chromium.org/forum/#!msg/blink-dev/he9tr7p3rZ8/eNMwKPmUBAAJ

So are there any plans to implement Certificate Transparency instead, given that it is the replacement suggested by Google?

Yes, but the work for that is more on the CA end than ours, from a technical perspective. Because of Google's deadlines, in practice virtually all CA vendors will have to automatically embed SCTs in all certs they issue by April 2018. Our vendors are already on top of this though, and we expect to have SCTs embedded in our next round of unified certificate renewals from our dual CAs: GlobalSign + Digicert happening this quarter.

GlobalSign added OV (our cert type) to the set they embed SCTs in, for all new OV certs issued after Oct 30, 2017. We're working on the GlobalSign renewal now and will issue in November.

Digicert claims to already support embedding SCT in all of their certificates, but it sounds like we may have to ask for it instead of assuming it will happen automatically (which we will). That renewal is coming up shortly as well, but will likely happen about a month after the GlobalSign one that we're working on now.