Page MenuHomePhabricator

TLS stats regression related to Chrome/41 on Windows
Closed, ResolvedPublic

Description

We've been looking at this issue off and on for over a week now, and it's gotten complex enough it deserves a ticket.

ecdhe-ecdsa-aes128-sha is one of our mid ciphers - it's forward-secret (good), but not AEAD (not the best). This and other ciphers in its category have generally been on a slow downward trend over the past several months in favor of higher-security ciphers as older clients get upgraded. However, this one cipher has turned around a bit lately in our stats, and it's puzzling and potentially highlights an issue we don't understand here.

It first began trending upwards notably starting around June 1, but the rate of increase was negligible at the time. Around July 19th a more significant rise began, leveling off to a new high by around July 23. At this point it's ~3-4% higher (in terms of overall request-rate) than we expect from historical trends and is significant. We don't have historical data that correlates our cipher statistics against other factors (such as geographic location, user-agent, etc), so investigating this is a bit tricky without insightful historical comparisons.

In live logging today, we see the bulk of the influx on this cipher seems to be user agent strings indicating a very specific version of Chrome/41. The UA strings look like: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36, but the Windows version near the front does vary, at least as far back as Win8 and maybe Win7. These UAs (even when filtering on the UA rather than the cipher that lead us here) predominantly negotiate (according to our nginx header output) as TLSv1.1 with the ecdsa-ecdhe-aes128-sha cipher . We know Chrome/41 should be capable of TLSv1.2 and stronger (AEAD) ciphers, and probably was negotiating them before whatever event precipitated this.

There have been lots of avenues we've looked at, but so far the most-promising has been that Microsoft released some updates to their SChannel and CryptoAPI libraries for all supported Windows versions around the right timeframe for this, and Chrome/41 probably makes indirect use of those libraries for some parts of TLS negotiation and/or certificate verification.

The Windows 7 version of this update can be seen here (similar ones exist for other versions): https://support.microsoft.com/en-us/help/22801/windows-7-and-windows-server-2008-r2-update-history

It's possible (but not very!) that this is interrelated with recent reports of a DHE compatibility issue between OpenSSL and Windows, which is discussed here: https://github.com/openssl/openssl/pull/1320 . We're in the process of deploying the bugfix for this starting today - https://gerrit.wikimedia.org/r/#/c/301903/ .

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack updated the task description. (Show Details)

The DHE-related Windows bugfix is deployed, and so far doesn't seem to have any effect on this (which I kinda expected, but had to check!). This bug was only supposed to happen on ~1/256 connection attempts anyways on average, and it was a stretch that it might be inter-related with this ticket.

Looks like buggy Microsoft updates (similar/related to the one(s) linked before) are the culprit. Some related reporting:

http://www.infoworld.com/article/3099109/microsoft-windows/microsoft-yanks-buggy-speed-up-patch-kb-3161608-replaces-it-with-kb-3172605-and-3172614.html
http://answers.microsoft.com/en-us/windows/forum/windows_7-networking/problems-with-kb-3161608-and-kb-3161639/2cd5ffb3-c203-4080-872f-73de1a96e080?auth=1

I'm guessing we'll just have to wait for Microsoft to eventually release patches that fix all the TLS bugs caused by these patches...

"KB 3161639" is an interesting search term. That's what's most-often cited in related bug reports. Some of the reports sound like this actually *is* related to the interop bug on DHE ciphers (which we've applied a fix for on our end, to no avail), but it may just be that they've introduced more than one bug at the same time here.

The history around the Windows updates involved is confusing (it seems they released a bad patch, then stopped offering it as an update, then started offering a replacement patch that was merely optional? It's confusing as hell). But all of that aside, apparently the crux of the issue is that the update added two new DHE-based ciphers, and also re-ordered (and perhaps expanded?) the cipher list significantly. The new list is 1449 textual bytes long, and apparently it breaks things if it's longer than 1023 characters.

The new list is supposedly:

TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P384
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA_P256
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA_P384
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA_P256
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA_P384
TLS_DHE_RSA_WITH_AES_256_GCM_SHA384
TLS_DHE_RSA_WITH_AES_128_GCM_SHA256
TLS_DHE_RSA_WITH_AES_256_CBC_SHA
TLS_DHE_RSA_WITH_AES_128_CBC_SHA
TLS_RSA_WITH_AES_256_GCM_SHA384
TLS_RSA_WITH_AES_128_GCM_SHA256
TLS_RSA_WITH_AES_256_CBC_SHA256
TLS_RSA_WITH_AES_128_CBC_SHA256
TLS_RSA_WITH_AES_256_CBC_SHA
TLS_RSA_WITH_AES_128_CBC_SHA
TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384_P384
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256_P256
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256_P384
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384_P384
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256_P256
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256_P384
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA_P256
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA_P384
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P256
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P384
TLS_DHE_DSS_WITH_AES_256_CBC_SHA256
TLS_DHE_DSS_WITH_AES_128_CBC_SHA256
TLS_DHE_DSS_WITH_AES_256_CBC_SHA
TLS_DHE_DSS_WITH_AES_128_CBC_SHA
TLS_RSA_WITH_3DES_EDE_CBC_SHA
TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA
TLS_RSA_WITH_RC4_128_SHA
TLS_RSA_WITH_RC4_128_MD5
TLS_RSA_WITH_NULL_SHA256
TLS_RSA_WITH_NULL_SHA
SSL_CK_RC4_128_WITH_MD5
SSL_CK_DES_192_EDE3_CBC_WITH_MD5

If you look for the ~1024 byte mark within the text of that list, it falls smack in the middle of the same un-ideal cipher we're now selecting. So, definitely related, although it's difficult to reason about exactly how the bug works from here without such a machine to test on.

I take that back, I've found a way to reason about the latter part. There are supposed to be commas on the end when counting bytes. Counting them like that, the 1024-boundary happens in TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA_P384, which is the one just above the TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P256 cipher they end up picking with us. So likely the behavior with a 1024+ byte list is that when the limit is reached it basically starts the list over again with the next cipher as the first one in the list.

Amended paste with the cutoff annotated with 3 comment lines injected in the midde:

TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256,
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384,
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256,
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P384,
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA_P256,
TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA_P384,
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA_P256,
TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA_P384,
TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,
TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,
TLS_DHE_RSA_WITH_AES_256_CBC_SHA,
TLS_DHE_RSA_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_AES_256_GCM_SHA384,
TLS_RSA_WITH_AES_128_GCM_SHA256,
TLS_RSA_WITH_AES_256_CBC_SHA256,
TLS_RSA_WITH_AES_128_CBC_SHA256,
TLS_RSA_WITH_AES_256_CBC_SHA,
TLS_RSA_WITH_AES_128_CBC_SHA,
TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384_P384,
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256_P256,
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256_P384,
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384_P384,
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256_P256,
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256_P384,
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA_P256,
TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA_P384,
# XXX the 1024 mark is inside the above ^
# XXX if the effective list starts here, the first cipher below matches what
# they negotiate with us (and make sense based on ours + theirs lists):
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P256,
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA_P384,
TLS_DHE_DSS_WITH_AES_256_CBC_SHA256,
TLS_DHE_DSS_WITH_AES_128_CBC_SHA256,
TLS_DHE_DSS_WITH_AES_256_CBC_SHA,
TLS_DHE_DSS_WITH_AES_128_CBC_SHA,
TLS_RSA_WITH_3DES_EDE_CBC_SHA,
TLS_DHE_DSS_WITH_3DES_EDE_CBC_SHA,
TLS_RSA_WITH_RC4_128_SHA,
TLS_RSA_WITH_RC4_128_MD5,
TLS_RSA_WITH_NULL_SHA256,
TLS_RSA_WITH_NULL_SHA,
SSL_CK_RC4_128_WITH_MD5,
SSL_CK_DES_192_EDE3_CBC_WITH_MD5

The one we actually want them to negotiate is 7 items up from the breaking point: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256_P256

Also, should've mentioned the 1024-ish byte hint comes from Cisco : http://www.cisco.com/c/en/us/support/docs/customer-collaboration/unified-contact-center-express-1061/200556-UCCX-10-6-Pages-Does-not-Load-in-IE11-Af.html

MS has released some new rolled-up bugfixes for their Aug 9th "Patch Tuesday", which includes this gem that disables RC4 ciphersuites: https://support.microsoft.com/en-us/kb/3151631 . It's possible they've silently fixed other related bugs as part of the process, or that it will somehow indirectly help mitigate or change this situation, but it will probably be a while before we see the full impact of users installing this and restarting machines/browsers.

This traffic is still going strong, and we still don't have a solid explanation. To recap some further investigation since:

The common pattern is these UAs are doing a GET / on a wiki (commonly enwiki in the US, but split beween enwiki and ruwiki for the most part in esams). They get the expected 301 to /wiki/Main_Page (or the ruwiki equivalent in that case), and follow it fetching that. They commonly send an If-Modified-Since with this request and get a 304 response, meaning the browser does still have the page cached from the last time they did this cycle. There are relatively-few IPs doing this, and they tend to repeat the cycle incessantly, up to a rate of roughly once per second, at least over our short sampling windows. They do eventually go away for a time. There are 200 responses at a much lower rate. These would be the first query of the pattern for a given UA (browser startup?), and also every time Main_Page actually updates (fairly frequently).

Once again, the offending UAs are mostly Chrome/41.0.2272.76 on some variant of Windows (Windows NT (6\.[0123]|10\.0), which covers Vista->Win10). I've very rarely seen two other cases that seem similar: one old version of Firefox on Windows, and that same Chrome/41.0.2272.76 on Linux, but their rate is far far lower than the Chrome/41+Windows case, and they could be genuine legitimate user actions that just happen to match my filters.

My thinking at this point is I might have our VCL code look for this situation and return an error page. One whose details (for anyone that looks hard, anyways) might point the users back to this ticket for further investigation. I'm hoping the error code itself will stop the reload-madness and drop the errant stats back down, too.

Change 305029 had a related patch set uploaded (by BBlack):
Experimental error handling for buggy Win Chrome/41

https://gerrit.wikimedia.org/r/305029

Change 305029 merged by BBlack:
Experimental error handling for buggy Win Chrome/41

https://gerrit.wikimedia.org/r/305029

Change 305048 had a related patch set uploaded (by BBlack):
VCL syntax bugfix for fa3cb7e2c

https://gerrit.wikimedia.org/r/305048

Change 305049 had a related patch set uploaded (by BBlack):
Chrome/41 fix: move to backend_response

https://gerrit.wikimedia.org/r/305049

Change 305049 merged by BBlack:
Chrome/41 fix: move to backend_response

https://gerrit.wikimedia.org/r/305049

Change 305050 had a related patch set uploaded (by BBlack):
Chrome/41 fix: move to vcl_recv

https://gerrit.wikimedia.org/r/305050

Change 305050 merged by BBlack:
Chrome/41 fix: move to vcl_recv

https://gerrit.wikimedia.org/r/305050

Change 305053 had a related patch set uploaded (by BBlack):
Chrome/41 fix: try 403

https://gerrit.wikimedia.org/r/305053

Change 305053 merged by BBlack:
Chrome/41 fix: try 403

https://gerrit.wikimedia.org/r/305053

Change 305055 had a related patch set uploaded (by BBlack):
Chrome/41 fix: maybe retry-after will slow it down?

https://gerrit.wikimedia.org/r/305055

Change 305055 merged by BBlack:
Chrome/41 fix: maybe retry-after will slow it down?

https://gerrit.wikimedia.org/r/305055

Change 305058 had a related patch set uploaded (by BBlack):
Chrome/41 fix: 401 seems to stop it cold...

https://gerrit.wikimedia.org/r/305058

Change 305058 merged by BBlack:
Chrome/41 fix: 401 seems to stop it cold...

https://gerrit.wikimedia.org/r/305058

Update from the spam of VCL experiments above (and several that weren't through the repo as one-off tests on one host):

In attempting to interfere with the initial requests to / from the offending UA, I tried several approaches based on response status codes (all with a link to this ticket in the Reason field in case a real human sees the error), none of which worked:

  • 400 (Bad Request) - No real impact (other than not following the usual 301 to Main_Page, as with all these trials)
  • 403 (Forbidden) - Again no impact
  • 302 (Found), with a redirect to /_nonexistent and a Retry-After: header of 75 seconds. They never hit the fake URL afterwards, and again didn't slow down the requests to / from any given UA.
  • 503 (Service Unavailable), with Retry-After: set to 300-something seconds. They still kept spamming requests to / regardless.

The approach that did work was to send a 401 (Unauthorized) response, which generally causes a UA to pop up a username/password field. I set the related WWW-Authenticate header to specify an authentication "realm" which again contains a link to this ticket, which might get shown to any human that sees it. At least in brief isolated testing this "works", in that the spamming UAs stop spamming so hard. Will take a few minutes to see stats fallout, and of course we'll see if any live humans show up in this ticket reporting it...

The interesting bit is that while these were prominent requests, apparently Chrome/41-on-Windows isn't the whole story of the mysterious rise in ECDHE-ECDSA-AES128-SHA. There's still a whole lot more of these requests out there, and the remaining favorite seems to be requests to geoiplookup.wikimedia.org, mostly with no UA string at all. This ties all of this back into: T100902

There's not much left to do here and we're no longer actively investigating. However, I'd like to try removing the 401 hack at some point, to see if the offending clients are gone for good or not, before closing this.

Change 313828 had a related patch set uploaded (by BBlack):
Text VCL: remove synth side of Win Chrome/41 workaround

https://gerrit.wikimedia.org/r/313828

At least in the initial few minutes after removing the workaround, there's no apparent return of the bad traffic. Will leave this for a few days to be sure before reverting the synth side as well and resolving this.

Change 313828 merged by BBlack:
Text VCL: remove synth side of Win Chrome/41 workaround

https://gerrit.wikimedia.org/r/313828

BBlack claimed this task.

So far there doesn't seem to be any recurrence of the stats anomaly when removing the workaround. Closing for now, barring further evidence to re-open.

Of course, I spoke too soon. The stats anomaly is slowly becoming visible again, and live logging confirms these broken clients are starting up their old patterns again. At this point I think it's more likely a fake UA string from some kind of benchmarking or other software that's abusing us as a testpoint (this was brought up by someone else in a related ticket or email thread, somewhere).

I'm putting the VCL blocking back in place now, but with an additional constraint on the strangely-backdated TLS negotiation as well so that it's less likely to impact any legitimate users of this particular Chrome/41 version.

Change 314573 had a related patch set uploaded (by BBlack):
Text VCL: workaround fake/buggy Chrome/41 again

https://gerrit.wikimedia.org/r/314573

Change 314573 merged by BBlack:
Text VCL: workaround fake/buggy Chrome/41 again

https://gerrit.wikimedia.org/r/314573

At this point I think it's more likely a fake UA string from some kind of benchmarking or other software that's abusing us as a testpoint (this was brought up by someone else in a related ticket or email thread, somewhere).

For the sake of archives: T141506#2574732, maybe.

Closing this for lack of anything real to do here, it's just cluttering task lists. We'll leave the workaround in place indefinitely until we find a reason to do otherwise.

Change 791005 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] varnish: Clean up Chrome 41 TLSv1.1 synth responses

https://gerrit.wikimedia.org/r/791005

Change 791005 merged by Vgutierrez:

[operations/puppet@production] varnish: Clean up Chrome 41 TLSv1.1 synth responses

https://gerrit.wikimedia.org/r/791005