Page MenuHomePhabricator

Planning for phasing out non-Forward-Secret TLS ciphers
Closed, ResolvedPublic

Description

For reference on our ciphersuite list, see: https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ssl_ciphersuite.rb

With our current ciphersuite code, switching a server/service to "mid" is how we disable non-Forward-Secret clients, which is the next obvious step up in our TLS security for all clients in general.

Rationale

  1. By getting to 100% Forward Secrecy, we remove some incentive for 3rd parties to try to log bulk encrypted traffic and then later attempt to steal the private key to decrypt the unprotected portions of it. We've already removed some of that incentive by getting our Forward Secrecy up into the 99%+ range, however there could still be incentives to go after the remaining non-FS traffic in various scenarios and regions...
  1. By eliminating these older / non-FS cipher options from our servers, we're categorically preventing unknown future protocol downgrade attacks that might attempt to force newer and better clients to negotiate these weaker options.
  1. The remaining clients which need the "compat" ciphers being eliminated are, categorically, all very ancient and insecure. The dominant client in this set, IE8-on-XP, is almost comically insecure. We want to promote users off of these insecure platforms in order to have better trust in the authenticity of our users' traffic, and also to help the users themselves (who likely use these platforms for other secure computing tasks as well, unwittingly), and generally improve the security landscape of the internet by promoting users off of platforms that are easy botnet enslavement targets (such botnets are in turn used to attack Wikipedia and other sites).

Issues Preventing

Moving up to "mid" for a given server/service probably shouldn't be done for any implementation that doesn't have proper DHE support (as otherwise the list of incompatible clients is longer than discussed in this ticket). Currently that means only nginx can move to "mid". Apache can do it in theory as well, but the latest apache packages we have deployed (apache-2.4 from jessie upstream) aren't capable: we need to at least re-compile them against OpenSSL 1.0.2 first.

When speaking of common human browser agents, moving to "mid" with DHE support primarily means dropping compatibility with IE7-8 on WinXP (IE8 on later versions of windows is unaffected, as are alternative browsers on WinXP). Other compatibility fallout includes ancient feature-phones, older media devices / set-top boxes, and un-updated/commercial Java6-based code. In another sense, this switch gets us past a very long unknown tail of clients that were published before the modern era of really paying significant attention to modern TLS security issues in general.

For our primary wiki traffic (text-lb, upload-lb), we probably won't be able to rationally switch to "mid" for quite some time, possibly a year or more out from now, as we still have a non-ignorable IE8/XP user population we can't cut off yet. However, I think we can start the process by moving other services to "mid" ahead of the eventual switch of the primary clusters. We could start with one-off services that are more technical in nature, which normal users would rarely connect to and aren't critical to them, such as icinga.wikimedia.org.

One of the other logical steps before (perhaps long before) switching the primary clusters is to switch the "misc" cluster, which hosts a variety of non-primary services, including some that non-technical users are more-likely to need to access to, such as annual, transparency, scholarships, iegreview, planet, etc.

How/When do we move forward?

Right now we don't even have any clearly-defined standards or thresholds for how we'll make the "mid" decision for a given service/cluster. Some possible candidates would be:

  1. Percentage of requests

    We could define an arbitrary threshold in terms such as "When affected request percentage to a given service falls below 0.1% averaged over an entire month" (where "affected" means clients that would be shut out by a switch to "mid"). We could define different thresholds for different services, too, but note that a whole cache-cluster is a "service" for these purposes. So all wikis are bundled together, as are all services on the "misc" cluster.

    For reference, when averaging requests over all of our cache endpoints (text, upload, misc, etc combined), the affected percentage for a switch to "mid" averaged over the previous 30 days as I write this is 0.7% . If we broke it down by-cluster, it's likely that the number is lower on misc than on text and upload.
  1. Age of newest affected clients

    We could do some age-based decisions on how outdated the affected clients are, and/or how long they've been vendor-unsupported. The primary case here for moving to "mid" is IE8-on-XP.

    Relevant data on WinXP from https://en.wikipedia.org/wiki/Windows_XP#Support_lifecycle : WinXP was end-of-life'd (after a ton of pre-announcement and campaigning) on April 8, 2014, a little over 1.5 years ago at this time. Shortly before the EOL date, Microsoft pushed an update which causes WinXP to show a pop-up reminder of the EOL status to users once a month to remind them to upgrade off of it. The last major updated release of Windows XP (SP3) was released in mid-2008, and the final cutoff of licensing to manufacturers to install new copies of XP SP3 was in 2010.

    Importantly, it is widely known at this time that WinXP has multiple security vulnerabilities which are remotely exploitable and were discovered after the End-of-Life date, for which there are no security update patches. Knowledge of how to exploit these vulnerabilities is widespread, making it a completely insecure platform at this point in time.

    However, it still enjoys about 10% client platform popularity in various studies currently, with significantly higher percentages in certain locales such as China. In our own User-Agent statistics we're seeing MSIE 8 (which may not be on XP, but probably is predominantly on XP) at 1.28% of all requests, and XP paired with any browser (which includes unaffected third-party browsers) at 3.88%. Again, the total affected percentage in our TLS request stats is 0.7%, which would represent the intersection of specifically MSIE8-on-XP as well as several other far-less-popular long tail clients.
  1. When they eventually break security for other clients

    This is the reactive last-resort option. We can wait until it becomes absolutely necessary because there's a known and/or announced vulnerability that affects other more-modern clients' TLS connections, for which the only fix is to disable some older ciphersuites and/or TLS protocol versions which implies an effective switch to "mid" or higher. This is essentially what happened in late 2014 with the POODLE exploit, which required us to disable SSLv3, which in turn shut off support for IE6-on-XP.

    What's bad about this option is that whenever this future date hits, there's a good chance that said exploit will already have been known and exploited by some third parties ahead of our knowledge and reaction. It would be more proactive if we got past the "mid" hurdle before such an event recurs, rather than as a reaction to it.

Regardless, we could/should look at phased approaches other than by service/cluster when it comes to IE8/XP specifically. Some options we could deploy in advance would include things like CentralNotice banners triggered only for IE8/XP urging users to upgrade away from it, or possibly disabling some features or capabilities for such ancient clients (logging into accounts with special privileges? editing in general?). Open question whether either of those should be triggered just on IE8/XP specifically, or as a blanket rule for any client which fails to negotiate a forward-secret cipher.

Related Objects

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added a project: Traffic.
BBlack added subscribers: BBlack, faidon, Joe and 4 others.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

We could start with one-off services that are more technical in nature, which normal users would rarely connect to and aren't critical to them, such as icinga.wikimedia.org.

I support this. There are many other such domains that I think we can turn to "mid" now, including gerrit, rt, wikitech, wikitech-static, ticket, librenms, and tendril. Note that https://lists.wikimedia.org already uses "mid" cipher suite.

For other services (misc, text, etc.), I tend to support moving to mid "when they eventually break security for other clients", and I think it's very unlikely that this date will ever arrive.

fgiunchedi triaged this task as Medium priority.Dec 1 2015, 12:26 PM
fgiunchedi subscribed.

Change 299997 had a related patch set uploaded (by BBlack):
ssl_ciphersuite: drop non-FS AES256 options

https://gerrit.wikimedia.org/r/299997

Copying in an old argument never recorded, I think from @faidon: While many services on cache_misc are obvious targets for "mid", phabricator itself is also hosted there, and thus switching cache_misc to "mid" would prevent affected users from reporting their problem. However, we probably could and should go ahead and upgrade the more operations-focused domains we control. "lists" is still the only site we've ever switched to "mid", but nobody's ever complained about it AFAIK.

A current and complete list of possible candidates in decreasing order of how much we care about security would probably be:

  • gerrit
  • librenms
  • icinga
  • tendril
  • ganglia
  • wikitech (and wikitech-static)
  • archiva
  • dumps

Going into that list a little deeper, though, there's a secondary pragmatic issue. Most (all?) of the servers for the services above are still in Precise or Trusty, and we don't support DHE suites on those operating systems. "mid" without DHE support shuts out an unacceptably-larger percentage of clients. So all of this still blocks on upgrading them to Jessie first.

gerrit will be replaced by https://gerrit-new.wikimedia.org/r/#/q/status:open soon-ish , then it will be jessie (and use Letsencrypt for certs).

tendril is running on the same server as icinga so both will be affected and blocked by T125023 / i don't see that happening soon somehow

wikitech/wikitech-static are maybe good candidates to go for next

i wonder who maintains archiva

Change 300065 had a related patch set uploaded (by BBlack):
ssl_ciphersuite: auto-downgrade to compat when necc

https://gerrit.wikimedia.org/r/300065

Change 300071 had a related patch set uploaded (by BBlack):
Ciphersuite upgrades for one-off sites

https://gerrit.wikimedia.org/r/300071

Change 300065 merged by BBlack:
ssl_ciphersuite: auto-downgrade to compat when necc

https://gerrit.wikimedia.org/r/300065

Change 299997 merged by BBlack:
ssl_ciphersuite: drop non-FS AES256 options

https://gerrit.wikimedia.org/r/299997

Change 300071 merged by BBlack:
Ciphersuite upgrades for one-off sites

https://gerrit.wikimedia.org/r/300071

Change 301817 had a related patch set uploaded (by BBlack):
ciphersuites: drop AES128(-GCM)?-SHA256

https://gerrit.wikimedia.org/r/301817

Change 301817 merged by BBlack:
ciphersuites: drop AES128(-GCM)?-SHA256

https://gerrit.wikimedia.org/r/301817

Recapping latest investigations, stats, and changes:

  1. We're down to just DES-CBC3-SHA and AES128-SHA on the non-forward-secret list. Everything else was dumped because it was statistically insignificant and/or provided little useful benefit over one of those two, and all clients that implemented the other options also implement those two.
  2. DES-CBC3-SHA: ~0.17% of requests globally. Is still primarily for IE[78]-on-XP, but also covers some other legacy devices (old feature phones, set top boxes, etc).
  3. AES128-SHA: ~0.25% of requests globally. Is primarily covering evil downgrading TLS proxies. For example, you'll see a Chrome/51 UA coming from either a corporate or cloud-proxy IP (e.g. Bluecoat), using a clearly-inferior cipher compared to what the browser can support. It's apparently common (especially if not software-updated) for these TLS interceptor appliances to downgrade security like this on the outbound side. BlueCoat in particular, in one software update for certain appliances as recently as 2015, only supported DHE-based forward secrecy and disabled it by-default "for performance reasons". However, this cipher also covers some legacy devices as with 3des above (ancient feature phones, set top boxes, etc).
  4. Together, those percentages add up to ~350 reqs/sec globally across all cache clusters.

We still don't have any kind of objective standard on when/if we'll dump these for primary wiki traffic. Related points for thought:

  1. Our non-FS cipher percentages continue to slowly decline, indicating users really are slowly upgrading and/or old devices are slowly dying, as expected.
  2. It would be interesting to break this down (based on UA+Ciphersuite combo) and find out the percentage are legit legacy clients vs downgrading proxies, but we don't have that data yet.
  3. It would also be interesting to infer from related stats how many of the legacy devices might be users' only access to us (which is more important), vs being some secondary legacy devices of theirs. For instance, some of these cipher choices are from old game consoles. Most likely anyone in the economic category that buys game consoles owns other, newer devices, and we really shouldn't worry about these so much. However, some of these cipher choices might be from low-end feature phones that continue to be a popular primary access choice in some countries that are behind the tech adoption curve.
  4. We've dumped the non-forward-secret ciphers for many one-off sites (rest pending on Jessie upgrades), including e.g. lists.wikimedia.org, gerrit.wikimedia.org, dumps.wikimedia.org, archiva.wikimedia.org, etc. These are all not on our cache clusters and thus not part of the cipher percentages we track. Nobody's complained about it yet. These sites are mostly used by more-technical users, though, and don't reflect our general global reader population.
  5. PCI-DSS is requiring TLSv1.1+ in the future. They've set their final cutoff date for compliance as 2018-06-30, but are encouraging sites to try to get there sooner for security sake. This obviously only applies to ecommerce sites and not us, but it does impact the utility of any non-compliant browsers significantly for users. When the world effectively goes TLSv1.1+ because of this, the non-forward-secret ciphers will be effectively-dead, because real clients that implement nothing better than non-forward-secret ciphers invariably also don't implement TLSv1.1+.
  6. Even though phabricator itself is in cache_misc, I still favor cache_misc to be the first cluster to go non-forward-secret ahead of the others. It gets far less traffic, and it doesn't block actual wiki reading when we do that, and some of the cache_misc sites are more security-sensitive than the wikis anyways.
  7. We've discussed before running a CN campaign targeting users with bad cipher choices asking them to upgrade the browser/OS for security sake. This still might be effective for some clients (e.g. IE8/XP) if we're sure CN and related bits are compatible with that browser. However, for many of the legacy embedded devices, JS may not work (correctly for this) or be disabled/unusable. Also, it would be a very confusing message for users of modern OS+Browser combos that are only seeing it because of a TLS-downgrading corporate proxy. Those two categories are probably the bulk of it outside of IE8/XP, so we might be better off ignoring actual cipher choices and just focusing on UA detection of IE8/XP specifically and recommending OS/browser upgrades just for those.

https://gerrit.wikimedia.org/r/#/c/306935/ probably should've linked here. This is a sort of temporary measure to start bugging users to upgrade browsers with horrible TLS implementations, or living behind broken corporate proxies, at a very low rate. In practice it seems to be redirecting about one pageview every 5 seconds or so globally, on average.

We should probably move to a CN-based approach as in (7) above, instead of trying to make this redirect hack apply more-often or more-broadly. There's still some question-marks about the mechanism of triggering the CN, both on a technical level (e.g. do we set a special Cookie that CN observes?) and on a usefulness level (the points above about CN JS support on ancient IE, corporate proxy confusion, etc). Either way we probably want to host the information itself somewhere that's varnish-cached (wikitech is not, presently).

Change 307977 had a related patch set uploaded (by BBlack):
text VCL: bad browser redirect: target IE8/XP more

https://gerrit.wikimedia.org/r/307977

Change 307977 merged by BBlack:
text VCL: bad browser redirect: target IE8/XP more

https://gerrit.wikimedia.org/r/307977

From the CN subtask ( T144194 ), we've learned that CN won't be a useful tool for targeting the mass ancient browsers that matter. It may be able to play a secondary role with the "evil downgrading proxy" case, though.

And over in the OpenSSL-1.1 task ( T144523 ), we're talking about the fact that 1.1 disables 3DES by-default at compile time. We can (and quite possibly might) re-enable it temporarily when we first deploy. However, as more sites on the Internet upgrade to this in general, we'll probably see a decline in IE8/XP users' ability to usefully browse the web, which should hopefully force more upgrades away from it.

I surveyed most of the current Alexa Top-100 for the 3DES issue specifically. If you filter out domains from that list which either don't support TLS at all or don't enforce redirect (i.e. still work fine over unencrypted HTTP), the only major domains I noted that are enforcing TLS and failing to support 3DES (thus already locking out IE8/XP) so far are tumblr.com and github.com. AWS has also announced that their standard ELB config has dropped 3DES support going forward for the default policies, but users can configure alternate/backdated policies to turn it back on.

We've discussed (at our offiste meetings) our strategy for removing the final pair of non-forward-secret ciphers (DES-CBC3-SHA and AES128-SHA). Will create subtasks for each outlining the details...

Jdforrester-WMF removed a project: Patch-For-Review.
Jdforrester-WMF subscribed.

I believe that the planning and execution of the work is all now complete?