For reference on our ciphersuite list, see: https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ssl_ciphersuite.rb
With our current ciphersuite code, switching a server/service to "mid" is how we disable non-Forward-Secret clients, which is the next obvious step up in our TLS security for all clients in general.
Rationale
- By getting to 100% Forward Secrecy, we remove some incentive for 3rd parties to try to log bulk encrypted traffic and then later attempt to steal the private key to decrypt the unprotected portions of it. We've already removed some of that incentive by getting our Forward Secrecy up into the 99%+ range, however there could still be incentives to go after the remaining non-FS traffic in various scenarios and regions...
- By eliminating these older / non-FS cipher options from our servers, we're categorically preventing unknown future protocol downgrade attacks that might attempt to force newer and better clients to negotiate these weaker options.
- The remaining clients which need the "compat" ciphers being eliminated are, categorically, all very ancient and insecure. The dominant client in this set, IE8-on-XP, is almost comically insecure. We want to promote users off of these insecure platforms in order to have better trust in the authenticity of our users' traffic, and also to help the users themselves (who likely use these platforms for other secure computing tasks as well, unwittingly), and generally improve the security landscape of the internet by promoting users off of platforms that are easy botnet enslavement targets (such botnets are in turn used to attack Wikipedia and other sites).
Issues Preventing
Moving up to "mid" for a given server/service probably shouldn't be done for any implementation that doesn't have proper DHE support (as otherwise the list of incompatible clients is longer than discussed in this ticket). Currently that means only nginx can move to "mid". Apache can do it in theory as well, but the latest apache packages we have deployed (apache-2.4 from jessie upstream) aren't capable: we need to at least re-compile them against OpenSSL 1.0.2 first.
When speaking of common human browser agents, moving to "mid" with DHE support primarily means dropping compatibility with IE7-8 on WinXP (IE8 on later versions of windows is unaffected, as are alternative browsers on WinXP). Other compatibility fallout includes ancient feature-phones, older media devices / set-top boxes, and un-updated/commercial Java6-based code. In another sense, this switch gets us past a very long unknown tail of clients that were published before the modern era of really paying significant attention to modern TLS security issues in general.
For our primary wiki traffic (text-lb, upload-lb), we probably won't be able to rationally switch to "mid" for quite some time, possibly a year or more out from now, as we still have a non-ignorable IE8/XP user population we can't cut off yet. However, I think we can start the process by moving other services to "mid" ahead of the eventual switch of the primary clusters. We could start with one-off services that are more technical in nature, which normal users would rarely connect to and aren't critical to them, such as icinga.wikimedia.org.
One of the other logical steps before (perhaps long before) switching the primary clusters is to switch the "misc" cluster, which hosts a variety of non-primary services, including some that non-technical users are more-likely to need to access to, such as annual, transparency, scholarships, iegreview, planet, etc.
How/When do we move forward?
Right now we don't even have any clearly-defined standards or thresholds for how we'll make the "mid" decision for a given service/cluster. Some possible candidates would be:
- Percentage of requests
We could define an arbitrary threshold in terms such as "When affected request percentage to a given service falls below 0.1% averaged over an entire month" (where "affected" means clients that would be shut out by a switch to "mid"). We could define different thresholds for different services, too, but note that a whole cache-cluster is a "service" for these purposes. So all wikis are bundled together, as are all services on the "misc" cluster.
For reference, when averaging requests over all of our cache endpoints (text, upload, misc, etc combined), the affected percentage for a switch to "mid" averaged over the previous 30 days as I write this is 0.7% . If we broke it down by-cluster, it's likely that the number is lower on misc than on text and upload.
- Age of newest affected clients
We could do some age-based decisions on how outdated the affected clients are, and/or how long they've been vendor-unsupported. The primary case here for moving to "mid" is IE8-on-XP.
Relevant data on WinXP from https://en.wikipedia.org/wiki/Windows_XP#Support_lifecycle : WinXP was end-of-life'd (after a ton of pre-announcement and campaigning) on April 8, 2014, a little over 1.5 years ago at this time. Shortly before the EOL date, Microsoft pushed an update which causes WinXP to show a pop-up reminder of the EOL status to users once a month to remind them to upgrade off of it. The last major updated release of Windows XP (SP3) was released in mid-2008, and the final cutoff of licensing to manufacturers to install new copies of XP SP3 was in 2010.
Importantly, it is widely known at this time that WinXP has multiple security vulnerabilities which are remotely exploitable and were discovered after the End-of-Life date, for which there are no security update patches. Knowledge of how to exploit these vulnerabilities is widespread, making it a completely insecure platform at this point in time.
However, it still enjoys about 10% client platform popularity in various studies currently, with significantly higher percentages in certain locales such as China. In our own User-Agent statistics we're seeing MSIE 8 (which may not be on XP, but probably is predominantly on XP) at 1.28% of all requests, and XP paired with any browser (which includes unaffected third-party browsers) at 3.88%. Again, the total affected percentage in our TLS request stats is 0.7%, which would represent the intersection of specifically MSIE8-on-XP as well as several other far-less-popular long tail clients.
- When they eventually break security for other clients
This is the reactive last-resort option. We can wait until it becomes absolutely necessary because there's a known and/or announced vulnerability that affects other more-modern clients' TLS connections, for which the only fix is to disable some older ciphersuites and/or TLS protocol versions which implies an effective switch to "mid" or higher. This is essentially what happened in late 2014 with the POODLE exploit, which required us to disable SSLv3, which in turn shut off support for IE6-on-XP.
What's bad about this option is that whenever this future date hits, there's a good chance that said exploit will already have been known and exploited by some third parties ahead of our knowledge and reaction. It would be more proactive if we got past the "mid" hurdle before such an event recurs, rather than as a reaction to it.
Regardless, we could/should look at phased approaches other than by service/cluster when it comes to IE8/XP specifically. Some options we could deploy in advance would include things like CentralNotice banners triggered only for IE8/XP urging users to upgrade away from it, or possibly disabling some features or capabilities for such ancient clients (logging into accounts with special privileges? editing in general?). Open question whether either of those should be triggered just on IE8/XP specifically, or as a blanket rule for any client which fails to negotiate a forward-secret cipher.