Page MenuHomePhabricator

Switch to ECDSA hybrid certificates
Closed, ResolvedPublic

Description

For performance reasons, we should switch to ECDSA Hybrid (= signed by an RSA CA) certificates for our projects. This will bring us performance benefits, both in terms of RTTs and CPU usage.

Since we now do SNI and the set of browsers that support SNI & ECDSA is the same (which we should double-check), we can safely do this with a wide UA impact and without hurting any non-compatible UAs, as long as we keep our fallback "unified" certificate RSA.

This is blocked on availability of those certificates, as most CAs do not sell them yet. Our new CA provider, GlobalSign, had previously gave us a Q1 2015 ETA for these, right before we picked them. I asked our representative last Tuesday (and again today) to confirm that they're still on track for this.

Event Timeline

faidon claimed this task.
faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon added subscribers: Aklapper, faidon, mark, BBlack.

Have you looked at SSLMate for a CA (reseller)? https://sslmate.com/

I've been using them at home and at work (https://myra.treasury.gov uses a cert by them), and their CLI based approach to certificate management is a total joy. They're also quite reasonably priced, and extremely responsive.

I also bring them up because they're rolling out ECDSA certificates shortly, and I'm sure that for Wikipedia they'd turn on access for you right now.

We've evaluated a bunch of certificate vendors & solutions (most of them, I'd say) and personally, I've been aware of SSLMate since its inception. Unfortunately we have a number of unique requirements that very few CAs are able to deliver, such as issuing certificates with multiple wildcards for different TLDs, explicitly setting notBefore/notAfter etc. and that's just from the technical side (we've had legal reject the CPS of at least one CA). We very recently switched our business to GlobalSign and we did so with the expectation of ECDSA availability this calendar quarter, so we're good on this front I'd say. So far :)

The response from our GlobalSign rep is that they don't have a firm timeline and can't commit to a Q1 rollout yet.

faidon lowered the priority of this task from High to Medium.Feb 24 2015, 11:59 AM
BBlack changed the task status from Open to Stalled.Apr 26 2015, 11:06 PM
BBlack moved this task from Blocked on External to Backlog on the Traffic board.

We have some hints now that we may get to move on an ECDSA solution sometime in June. More details later on, but just noting for planning/estimation.

So, Globalsign announced the cert support here: https://www.globalsign.com/en/blog/elliptic-curve-cryptography/ , and it seems we can reissue current certs as ECC from the portal. However, in talking this over on IRC, it seems like our earlier assumption of browsers supporting ECC being a subset of those supporting SNI is probably flawed. At least one significant SNI-but-no-ECC case is known: Chrome on WinXP.

So, we might not be able to switch to ECC, unless we can deploy parallel ECC/RSA certs for the same server instance. Apache has supported that for a while, but the patch to do it for nginx was never merged, primarily owing to complications in the patch's interactions with their internal OCSP Stapling fetch/verify code, which we're not using anyways (we fetch/verify externally and use ssl_stapling_file instead).

The last version of the patch was here: http://mailman.nginx.org/pipermail/nginx-devel/2013-October/004474.html , but the whole thread is worth reading if we decide to go down some variant of that path. It doesn't apply cleanly to more-recent nginx versions, but it might be fairly trivial to port it forward, especially if we don't care about the stapling problems (we don't). The downside is I'd really rather not put us in the boat of maintaining our own non-upstream nginx patches. Another alternative would be to try to finish the work of the original patch author and get it accepted upstream, with all of the stapling stuff sorted out.

Another alternative would be to try to finish the work of the original patch author and get it accepted upstream, with all of the stapling stuff sorted out.

You'd be doing the web community a very nice service if you went that route. I wonder if you could connect with someone at Cloudflare to collaborate on the patch, or at least giving it some feedback and review.

Let me just note that at least some of Cloudflare's ECDSA certificates are incompatible with Opera 12.x (which represents a bit more than 0.5% of our page views). Pages using them are completely inaccessible for Opera 12 users even after adding the certificate to the whitelist. Please make sure we do not accidentally lock out these users from our sites.

Yeah, that's why we're blocking on dual-cert support in nginx (or other solution to the problem). There's similar issues with e.g. Chrome on WinXP, so we won't be doing ECDSA until we can serve both certs in parallel to work around the issue.

Status update: I've built on debian upstream's 1.9.1 package, patched to 1.9.2, and then forward-ported the latest variant of the multicert patches from http://mailman.nginx.org/pipermail/nginx-devel/2015-March/006734.html , that that's running on the test node (cp1008/pinkunicorn) with our prod config + reuseport enabled and seems basically ok. This is just a local test build, not a real package to upload to our repo (but the patches are formed up in debian/patches/ cleanly for later reuse in a real package if we want).

I haven't yet tried to actually use the multi-cert patch functionality, but I suspect just from reading the patches, that it needs additional work to support ssl_stapling_file (which is what we use for stapling) to have different stapling response files per certificate.

Change 220182 had a related patch set uploaded (by BBlack):
Add Filipe da Silva's multicert patches, forward-ported

https://gerrit.wikimedia.org/r/220182

Change 220182 merged by BBlack:
Add Filipe da Silva's multicert patches, forward-ported

https://gerrit.wikimedia.org/r/220182

@JanZerebecki mentioned in https://gerrit.wikimedia.org/r/#/c/220377 :

For the explanation of the current ordering see https://wiki.mozilla.org/Security/Server_Side_TLS#Prioritization_logic . (AES-GCM over AES because of security, AES128 over 256 because of speed while it seems it dosn't make security weaker as the weakest link is elsewhere.) For most clients out there it probably makes no difference if we order ECDSA over RSA first or only after the AES ordering. But theoretically it may make sense if 2k RSA is the weak link. Another argument for it is that latency because of bigger hand shake is more important than ongoing CPU cost.
Moving to EC in general is I think a good idea, but then in practice we can not freely pick the best curve. HTTPS implementations are way behind openssh. Are we limited to secp256r1, secp384r1, secp521r1 (which is what https://www.ssllabs.com/ssltest/viewMyClient.html says my browser supports)? These are not evaluated on http://safecurves.cr.yp.to/ . Is there some assessment of these curves regarding the properties on safecurves? Where these generated in a nothing up my sleeve way? Sadly I don't have the time now to research this, so no vote either way from me.

So, to answer some of the above and recap where IRC conversations about this have gone lately:

  • Yes, we're limited by TLS protocol support and browsers to only the NIST curves right now. I think for major browser support over TLS, we're limited to the NIST 256 and 384 -bit curves. GlobalSign (the CA we'd be using to issue) supports NIST 256, 384, and 512: https://support.globalsign.com/customer/portal/articles/1994347-ecc
  • We've acquired a live ECC key from GlobalSign already for our unified wildcard production cert, using the NIST 256 curve (prime256v1 in openssl ecparam terms) in order to do some testing outside of production, but have not yet deployed that to production. In rough terms, this is equivalent to a 3K RSA key, so it's better than the 2048 RSA's we have, at least on that one level of thinking about such things.
  • There are some browsers which we still care about which do not support ECDSA at all. Notable on that list I believe are things like IE8/XP and Android 2.x, probably among a few others. @faidon did some sampling on our traffic and saw that something like 94% of our client traffic is ECDSA capable in the overall, but clearly we're not going to deploy something that kills the other 6% :)
    • Because of the above, we'd have to have a way to support two different server-side keypairs in parallel, switching depending on client support and our cipher pref list between the ECDSA and RSA variant keys.
    • That (multiple certs w/ different algorithms) requires special server-side support that nginx currently lacks, but we've been testing a bit with the patches linked earlier in this ticket to enable that functionality.
    • Also, the multi-cert thing has some unknown interactions with OCSP Stapling (there is no explicit multi-cert support for ssl_stapling_file in the nginx patches yet). I think I've found answers to that problem (that our offline staple-fetcher can do what the nginx fetcher appears to do: staple both in one OCSP response, assuming same OCSP URL and intermediate cert), but I need to do further verification work here.

As for the questions about whether ECDSA is better than RSA in the overall, and whether the available curves are good enough: I think those are matters of opinion yet, and there are multiple comparison angles.

On security issues, I lean towards thinking that ECDSA is better than RSA, and that while the available curves are not as theoretically strong as they could be (cf the safecurves link from @JanZerebecki above), they're still very strong, esp given the effective key size bump from RSA-2048 to ECC-256. I know it's just a sort of sideways argument from authority, but Cloudflare likes it ( https://blog.cloudflare.com/ecdsa-the-digital-signature-algorithm-of-a-better-internet ), and in response to some of the notes in that article, our multi-cert patches require OpenSSL 1.0.2, so we'd get the fixes they mention from that as well.

On other fronts: Cloudflare's testing mirrors our own here: Signing on the server side gets approximately an order of magnitude faster with ECDSA, and the overall cert size sent during session negotiation is a couple hundred bytes shorter than equivalent RSA ones for now (could be more, eventually, with ECC intermediates and such as well).

Currently, I expect us to switch to dual ECDSA+RSA certs with the NIST 256 curve in the fairly near future, once we've worked out all the associated technical details above, and assuming no new snags in all of that crop up.

FYI: I did some force-pushing and branch-deleting to fix up previously-malformed branches in the operations/software/nginx repo:

  • The master branch was force-push overwritten to exactly track debian's master branch from git://anonscm.debian.org/collab-maint/nginx.git, and the intent is that it will always track that branch with ff-only merges and not contain any local commits.
  • Created a new branch wmf-1.9.2-1 which branches from the 1.9.2-1 release commit in master, and is solely for creating our 1.9.2-1+wmfN releases for stapling and multi-cert stuff. When upstream (debian) makes new releases, we'll do new branches like this for our +wmfN releases and port over whatever patches need porting, etc.
  • Deleted the previous branches wmf and wmf-192fix, which were based on the previous malformed master and applies our patchwork in the wrong order.

TL;DR - branch to follow for current local work is: https://git.wikimedia.org/log/operations%2Fsoftware%2Fnginx.git/refs%2Fheads%2Fwmf-1.9.2-1

pinkunicorn is now running the proposed setup:

  • openssl 1.0.2c
  • nginx rebuilt against openssl 1.0.2c, with multi-cert patches, from branch above
  • ECDSA-preferred ciphersuite changes from https://gerrit.wikimedia.org/r/#/c/220377/
  • dual ECC+RSA unified certs from GlobalSign (no separate SNI certs)
  • dual-cert ocsp stapling, regenerated once an hour from a hacky cronjob to keep it up to date, since our current stapling code doesn't understand how to do it.

Because the underlying host has had puppet disabled for a while, it's missing some other recent stuff like our latest HSTS max-age bump, so ignore that. I have yet to validate how the stapling issues work out, that's TBD over the next few days (sadly, I haven't found any direct debug info in FF/Chrome to see whether they liked our stapling or (in the FF case anyways) fell back to remote OCSP fetches, will probably just have to sniff for the remote OCSP fetches).

You can test the SSL-level stuff by pointing a browser at https://pinkunicorn.wikimedia.org/ , or see ssllabs results @ https://www.ssllabs.com/ssltest/analyze.html?d=pinkunicorn.wikimedia.org&hideResults=on

On security issues, I lean towards thinking that ECDSA is better than RSA, and that while the available curves are not as theoretically strong as they could be (cf the safecurves link from @JanZerebecki above), they're still very strong, esp given the effective key size bump from RSA-2048 to ECC-256.

Can you qualify "still very strong"? I have not actually tried to find out but it might be that the known problems ( https://www.hyperelliptic.org/tanja/vortraege/20130531.pdf ) in combination with a lacking implementation are enough to make it weaker than 2k RSA?

Maybe we should wait for browsers to enable using a curve that is more bad ass ( http://safecurves.cr.yp.to/bada55.html )?

More seriously, we are already using NIST P-256 for ECDHE:
$ openssl s_client -msg -connect en.wikipedia.org:443 2>/dev/null |grep -A 1 ServerKeyExchange
<<< TLS 1.2 Handshake [length 014d], ServerKeyExchange

0c 00 01 49 03 00 17 41 04 73 62 0f 9a 24 8f 97

The curve is identified by 0x0017 which is 23, see https://www.iana.org/assignments/tls-parameters/tls-parameters.xml#tls-parameters-8 for the mapping.

On security issues, I lean towards thinking that ECDSA is better than RSA, and that while the available curves are not as theoretically strong as they could be (cf the safecurves link from @JanZerebecki above), they're still very strong, esp given the effective key size bump from RSA-2048 to ECC-256.

Can you qualify "still very strong"?

No, I really can't qualify that. IANACyptographer. All I can do is read the same papers everyone else on the Internet does and try to make good judgements. I'll try to qualify my thinking as best I can, though.

I tend to think the issues pointed out by e.g. safecurves, probably aren't sufficient on their own to avoid using ECDSA keys with a curve like NIST P-256 today alongside a 2K RSA key, in our specific implementation scenario for widely-compatible real world use. Don't get me wrong: I've always been a huge fan of DJB, and I think he and others in the field are doing important work on these issues. I can't wait for the day when widely-implemented standards we can rely on can use better curves like the 2^255-19 -based ones.

But given where we're at today and what's available to us, I just don't know that we're losing much going from 2k RSA to ECDSA w/ NIST P-256 in this particular scenario, and we may really be gaining. A lot of the attack methods rely on timing and side-channels and the same methods are probably pretty bad against RSA as well. In our specific scenario, where we're doing a high volume of crypto on bare hardware we own as opposed to this being on a smartcard, or being used in a shared VM environment alongside attackers who can measure side-channel timing, or doing so few transactions that perf impacts are easy to isolate at a distance. I don't think some of these attacks are pragmatically achievable as improvements over attacking other aspects of our TLS security (similar attacks on RSA, TLS flaws common to both key choices, OpenSSL flaws common to both key choices, RNG attacks, simply compromising our systems or breaking into and/or colluding with CAs, etc).

Reading list, mostly from the ECDSA-negative/paranoid view:
http://safecurves.cr.yp.to/
http://blog.cr.yp.to/20140323-ecdsa.html
http://eprint.iacr.org/2013/734.pdf
This entire lengthy thread: https://www.ietf.org/mail-archive/web/tls/current/msg10149.html
Reddit attacking Cloudflare's "Yay ECDSA!" blog post: http://www.reddit.com/r/netsec/comments/20kaaw/ecdsa_the_digital_signature_algorithm_of_a_better/

Going back the other way, I guess a lot of what I'm saying here mirrors this guys' conclusions in Schneier's blog comments re pragmatism: https://www.schneier.com/blog/archives/2013/09/the_nsa_is_brea.html#c1678526

The only argument I've seen that really frightens me about NIST P-256 in our scenario is the idea that it could have been specifically engineered by its NSA author to be spectrally weak through a fairly exhaustive but plausible search of the parameter space. I don't think anyone has ever even implied any proof that this occurred, though.

Maybe we should wait for browsers to enable using a curve that is more bad ass ( http://safecurves.cr.yp.to/bada55.html )?

More seriously, we are already using NIST P-256 for ECDHE:
$ openssl s_client -msg -connect en.wikipedia.org:443 2>/dev/null |grep -A 1 ServerKeyExchange
<<< TLS 1.2 Handshake [length 014d], ServerKeyExchange

0c 00 01 49 03 00 17 41 04 73 62 0f 9a 24 8f 97

The curve is identified by 0x0017 which is 23, see https://www.iana.org/assignments/tls-parameters/tls-parameters.xml#tls-parameters-8 for the mapping.

Yeah, that's the nginx default we're using there, for the config parameter: ssl_ecdh_curve, Obviously, if we don't trust P-256, then we'd want to change that parameter. However, AFAIK the only other viable option that's widely compatible with browsers is NIST P-384, which shares a lot of the same question-marks. That or we shut off ECDH(E)-based ciphersuites completely, at which point we're back to... what? Regular DH, which is even weaker? Are there other realistic options here? So long as we continue to accept P-256 for the ECDHE part, we can't really argue that using a P-256 key for ECDSA is making things any worse, right?

Also, after writing all of the above, I feel it's important to reference xkcd's wisdom on this topic as well as a counterpoint: https://xkcd.com/538/ :)

Thanks.

Btw. cloudflare had again disabled ECDSA since quite some time. No comment from them in their blog. Any idea if there where other reasons besides the compatibility problems from T86654#1340042?

Thanks.

Btw. cloudflare had again disabled ECDSA since quite some time. No comment from them in their blog. Any idea if there where other reasons besides the compatibility problems from T86654#1340042?

I'm really not sure to be honest, and like you said they don't seem to have spoken publicly about it. The compatibility issues were with ECDSA-only setups, and elsewhere they mentioned they had patched their servers for multi-cert (probably similar to the patches we're trying out now). While https://blog.cloudflare.com does not seem to be dogfooding ECDSA, in a later blogpost about Universal SSL rollout they mentioned a particular customer's site to check out the new setup on here: https://ciaranmcnulty.com/ , and when I test that I do see ECDSA with a NIST P-256 key in use there.

I tested a few other major sites with s_client: twitter doesn't have ECDSA either, but google and facebook both do (using P-256 as well).

BBlack changed the task status from Stalled to Open.Jun 29 2015, 6:52 PM

Tested OCSP Stapling this evening as best I can: my test env was the latest stable FF release on Mac with strict OCSP prefs set. If I disabled stapling at the server completely, I could verify with a sniffer that it fetched OCSP from GlobalSign directly when I hit pinkunicorn after a fresh browser start, every time. With the dual-cert stapled response built in either order (RSA first or ECDSA first), FF did not fetch OCSP info from GlobalSign, and also did not give an error. Works for me :)

I'm basically satisfied with the tested raw setup at this point (it's also testing some DHE/ciphersuite stuff, but that's unrelated). Blocking work for putting this on live caches is now:

  1. Put a real nginx-1.9.2-1+wmf2 build into reprepro, built cleanly on the jessie-wikimedia openssl-1.0.2c-1 backport (some minor issues with backports + copper + git-pbuilder to sort out)
  2. Some puppetization issues to sort out (essentially, upgrade the "sslcert" concept to an "sslcertset" concept of related certs for OCSP + nginx config)
  3. Wait till we're done with the initial perf measurement period on non-SNI RSA-unified config, due to happen on live caches this week.

Change 222067 had a related patch set uploaded (by BBlack):
tlsproxy: multi-cert support, including ocsp

https://gerrit.wikimedia.org/r/222067

The necessary packages are now in jessie-wikimedia repo: (openssl-1.0.2c-1 in backports, nginx-1.9.2-1+wmf2 in main). We're not deploying these to prod machines until we're further along with the rest of the plan, though.

Change 222067 merged by BBlack:
tlsproxy: multi-cert support, including ocsp

https://gerrit.wikimedia.org/r/222067

Change 224011 had a related patch set uploaded (by BBlack):
Add ecc-uni.wikimedia.org cert

https://gerrit.wikimedia.org/r/224011

Change 224011 merged by BBlack:
Add ecc-uni.wikimedia.org cert

https://gerrit.wikimedia.org/r/224011

Now that all of the other supporting work is merged in puppet, cp1008 aka "pinkunicorn.wikimedia.org" now has a fully-puppetized test of ECDSA+RSA as of: https://gerrit.wikimedia.org/r/#/c/224012/3/modules/role/manifests/cache/ssl/unified.pp

The remaining steps to enable on the rest of the cache cluster are, essentially:

  1. Install our new openssl-1.0.2d-1~wmf1 + nginx 1.9.2-1+wmf3 packages (they're already in reprepro as upgrades for these machines, just haven't done "apt-get -y upgrade" since).
  2. Switch on dual certs in unified.pp for all (as with the cp1008 testing commit above).

Updated the packages on cp1065 for some live-testing of just that part.

I'm going to be out on vacation starting Friday the 17th, and I also probably shouldn't turn on ECDSA just before the weekend today either. That leaves Monday (13th) morning US-time as a good spot for deploying this, so I'll get 4 weekdays to observe before leaving related things on hold for a week or so.

This is done for the primary unified cert on the cache clusters now. At some later point we may want to go ECDSA+RSA for some of our minor certs on individual apaches, and/or the wmfusercontent.org and planet certs on misc-web, but that can be a separate task/discussion.

BBlack claimed this task.
BBlack moved this task from Traffic team actively servicing to Done on the Traffic board.

Change 224728 had a related patch set uploaded (by BBlack):
Port Filipe da Silva's multicert patches, bump libssl to 1.0.2

https://gerrit.wikimedia.org/r/224728

Change 224728 merged by BBlack:
Port Filipe da Silva's multicert patches, bump libssl to 1.0.2

https://gerrit.wikimedia.org/r/224728