Page MenuHomePhabricator

OpenSSL 1.1 deployment for cache clusters
Closed, ResolvedPublic

Description

There are multiple drivers to get on OpenSSL-1.1 sooner rather than later:

  1. Sec cleanup in 1.1's code quality -> potentially fewer future bugs.
  2. Supports chacha20-poly1305 out of the box without the cloudflare patch we're using for 1.0.2
  3. Supports x25519 for ECDHE key exchange, including with mismatched ECDSA curve (e.g. our prime256v1 ECDSA cert for auth + x25519 ECDHE for key exchange)
  4. Removal of a lot of junk legacy stuff, including all of SSLv2.
  5. Lots of other misc improvements in the ChangeLog of course

With the first public release having happened back on August 25, we're now in a position where we could theoretically get this working quickly. I already have a test build of nginx-1.11 + openssl-1.1.0 on pinkunicorn and it seems to function as expected for TLS termination.

These are the potential holdups/issues:

  1. I'm still a little leery of the newness of this first major 1.1 release. We may want to wait a little longer to see if there's a first wave of public bug reports and fixups first.
  2. The ChaPoly support lacks the older non-RFC draft-mode that the cloudflare patch we're running today has. I think this is basically-acceptable today if not ideal, and only gets more-acceptable with each passing day as more Chrome installs update. Current stats on the draft chapoly negotiations are ~20% of all chapoly (~4.5% of all) and declining.
  3. The default openssl-1.1 build includes 3DES in the definition of "weak ssl ciphers" that are disabled-by-default at compile time, alongside RC4. This is a decision that was made at the last minute before 1.1's release, in response to the SWEET32 birthday attack stuff. We can re-enable them with a debian/rules patch that turns both 3DES and RC4 back on, and I've done so in my latest experimental patch. I'm not 100% sure that's the right thing to do, but on the balance of things it seems too early to abruptly deny access to IE-on-XP in production at this point.
  4. Stock nginx-1.11.3 basically supports OpenSSL-1.1, but has some minor incompatibilities with the official OpenSSL-1.1.0 release (which came after it). Current nginx master has fixups for this (applied to our experimental package) that should land in 1.11.4, but it would be nicer to wait for the real 1.11.4 release.
  5. While libssl1.1 can co-exist with libssl1.0 on the same machine, there's only one /usr/bin/openssl. I've checked that our ocsp stapling scripts (which invoke the openssl CLI) seem to still work with the new CLI tools, but we may have several other uses of the CLI tools that could face compatibility issues here that need addressing first (e.g. stuff in our sslcert puppet module). A possible middle path here is to build nginx against libssl1.1 and only deploy the library (not the CLI) for jessie in general.
  6. The debian packaging of nginx-1.11 includes some 3rd-party modules that basic upstream nginx doesn't. Two of these modules (upstream_fair, and http_lua) do not build correctly against 1.1.0 libssl headers. I've disabled both in our experimental build as our cache terminators don't use them, but @yuvipanda pointed out that some labs uses of nginx do use the Lua module. For all I know there may already be upstream fixes for this at the 3rd party module sources that haven't yet trickled through the debian nginx packaging, or local fixups might be trivial.

Details

Related Gerrit Patches:
operations/software/nginx : wmf-1.11.4new variant for ECDHE curve logging

Event Timeline

BBlack created this task.Sep 1 2016, 7:24 PM
Restricted Application added a project: Operations. · View Herald TranscriptSep 1 2016, 7:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Thanks for getting this started!

My gut feeling is that we should wait for 1.1.0a . The release only happened a week ago and there will be inevitable bugs that only show in real world use. Any by that time we'll be able to see whether it still makes sense to care about draft/chapoly (probably not).

As for re-enabling triple-des/rc4 I'm undecided with a tendency towards sticking with the openssl default (depending on when we go forward with 1.1: This will definitely affect a fair amount of users, but OTOH Firefox still provides a compatible browser for XP, and we'll definitely not be the only website which will become inaccessible to IE/XP users when OpenSSL 1.1 gets more widespread. Maybe we can define a sunset date sometime in 2017 and use your Varnish redirect notification to alert users.

(This would also be a nice discussion topic for the offsite, BTW)

As for the packaging/coexisting of the libraries: Another thing to consider is the -dev package. If we only build selective applications against openssl 1.1 we could rename it to e.g. libssl1.1-dev and then target nginx specifically to it (since we're building custom nginx packages anyway)

Wrt nginx and compatibility with openssl 1.1, it would be awesome if you could add your findings to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=828453 (like which patches were cherrypicked etc.).

BBlack added a comment.Sep 2 2016, 5:06 PM

As for re-enabling triple-des/rc4 I'm undecided with a tendency towards sticking with the openssl default (depending on when we go forward with 1.1: This will definitely affect a fair amount of users, but OTOH Firefox still provides a compatible browser for XP, and we'll definitely not be the only website which will become inaccessible to IE/XP users when OpenSSL 1.1 gets more widespread. Maybe we can define a sunset date sometime in 2017 and use your Varnish redirect notification to alert users.

In general, it would be nice to dump DES-CBC3-SHA for sure (and for that matter, our other remaining non-forward-secret cipher, AES128-SHA). Technically we also support some forward-secret 3DES ciphers as well, but their real-world usage is so tiny as to be mostly-pointless. There's a task about it that's been evolving over time: T118181 . Aside from the TLS and general security issues, I'm sure our frontend developers would love to stop having to worry about compatibility with such ancient browsers, too. However, based on usage stats alone, I don't think we're yet at a point where it's reasonable to do so. We're getting closer every day, but we're still at ~0.2% for DES-CBC3-SHA, and something more like 0.5% for all non-forward-secret combined.

Putting that in perspective, that means we're averaging somewhere in the ballpark of ~180 HTTP requests/second from just DES-CBC3-SHA, which we might translate to something closer to ~10 pageviews/sec given the #reqs/view. If we were able to narrow that to just certain countries where WinXP still enjoys more popularity (e.g. China), the percentage is probably higher.

Were it not for the impetus of OpenSSL's release, we probably wouldn't have chosen to disable 3DES completely anytime soon, even in light of SWEET32. I do think we should campaign harder to reduce the percentage by educating users at this point, and we will have to define a cutoff somewhere, but I suspect a rational cutoff won't happen to align with whatever date we ultimately want to deploy OpenSSL-1.1.

(This would also be a nice discussion topic for the offsite, BTW)

Agreed!

As for the packaging/coexisting of the libraries: Another thing to consider is the -dev package. If we only build selective applications against openssl 1.1 we could rename it to e.g. libssl1.1-dev and then target nginx specifically to it (since we're building custom nginx packages anyway)
Wrt nginx and compatibility with openssl 1.1, it would be awesome if you could add your findings to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=828453 (like which patches were cherrypicked etc.).

I haven't cherrypicked anything, I just moved forward from our 1.11.3 release to include all of nginx upstream master's current changes (which should eventually become part of 1.11.4). I don't think upstream has a solution (yet) for 1.10. It's probably possible to backport several 1.11 commits (not just the most recent ones) onto 1.10 for compatibility, but we don't even run 1.10 here to begin with.

BBlack added a comment.Sep 6 2016, 8:28 PM

On the "potential post-1.1.0 issues" front so far, nothing too serious, but:

BBlack moved this task from Triage to TLS on the Traffic board.Sep 30 2016, 1:39 PM
BBlack added a comment.Oct 3 2016, 2:28 PM

We discussed this at the offsite, and we're reading to go with OpenSSL 1.1.0b. The plan is to patch our build such that the -dev package is version-differentiated in the package title (e.g. libssl1.1-dev), and then only upload the library and dev packages to jessie-wikimedia, and use them only to rebuild our custom nginx package (with custom deps on libssl1.1-dev). @MoritzMuehlenhoff will work out the packaging.

We'll also be diverging from upstream OpenSSL-1.1 a bit with two patches that I should rebase/cleanup first: the chapoly prefhack patch (the simpler variant), and re-enabling the default-disabled 3DES for now.

BBlack added a comment.Oct 4 2016, 5:57 PM

The chapoly prehack and +3des stuff are in the first 3 commits here and should rebase fine as they are: https://phabricator.wikimedia.org/diffusion/ODFP/history/wmf-1.1/ . Note the "wmf-1.1" branch was a temporary experiment - we can ignore it or force-push over it, or whatever makes sense.

@MoritzMuehlenhoff uploaded packages today, and nginx (1.11.4 + various local patches) is rebuilt against it and in testing on https://pinkunicorn.wikimedia.org/ now.

Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:18:04Z] <bblack> uploading nginx-1.11.4+wmf3 to carbon jessie-wikimedia - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:43:15Z] <bblack> upgrading nginx on all cache_misc @ ulsfo - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-18T14:54:40Z] <bblack> upgrading nginx on all cache_misc @ codfw - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-18T15:06:47Z] <bblack> upgrading nginx on all remaining cache_misc (eqiad, esams) - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-18T17:01:40Z] <bblack> upgrading nginx on cache_maps - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-19T11:30:40Z] <bblack> upgrading nginx on cp2001 (codfw text canary) - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-19T11:35:06Z] <bblack> upgrading nginx on cp2002 (codfw upload canary) - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-19T12:52:04Z] <bblack> upgrading nginx on codfw text+upload caches - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-19T13:31:44Z] <bblack> upgrading nginx on ulsfo text+upload caches - T144523

Mentioned in SAL (#wikimedia-operations) [2016-10-19T19:30:29Z] <bblack> upgrading nginx+openssl on remaining cache nodes (eqiad+esams/text+upload) - T144523

BBlack closed this task as Resolved.Oct 19 2016, 9:10 PM
BBlack claimed this task.

Done for now, assuming we don't find a reason to revert!

Change 320629 had a related patch set uploaded (by BBlack):
new variant for ECDHE curve logging

https://gerrit.wikimedia.org/r/320629

Change 320629 merged by BBlack:
new variant for ECDHE curve logging

https://gerrit.wikimedia.org/r/320629