Page MenuHomePhabricator

Investigate our mitigation strategy for HTTPS response length attacks
Open, MediumPublic

Description

There have been a number of (possibly still theoretical) attacks to HTTPS during which an adversary can guess which pages are being visited merely by looking at the response length. This has been mitigated by other websites in various ways, e.g. Twitter profile pictures are put into certain specific size buckets.

We are particularly affected as our pages' content is a) all public and explorable via dumps, making it easier for the attacker to experiment and precompute, b) static and identical in most cases (anonymous users), b) text, assets, images are split into separate IPs and hence different TLS sessions, which both removes a randomized factor and creates even more unique combinations of traffic patterns.

To mitigate this kind of an attack we have to pad our responses up to certain (unguessable) sizes. There's a number of considerations that need to explored before doing so:

  • As @csteipp points out, even a bucket classification won't be enough, as there are still enough bits of information there to make educated guesses based on click path behavior.
  • Padding the HTML with e.g. zeros will be ineffective, as gzip compression will remove most of it from there. We could pad the HTML with random garbage, though, that wouldn't be defeated by gzip.
  • Padding the HTML means that we'd have to pad other resources separately, some of which aren't even being served from MediaWiki (e.g. images/Swift).
  • Padding to specific bucket sizes removes the precomputation-from-dumps factor but does not insert any randomness into the process. A padded text page + its associated, padded, images could still provide enough bits of information to figure out the page visited.
  • Padding obviously increases the content size and comes with obvious performance issues; it's essentially a security/performance tradeoff. Depending on which piece of infrastructure it actually happens, it might also increase storage and/or cache size needed.

So far, it seems more likely that something that we could apply on the edge (either Varnish or nginx) and, potentially make it both bucket-based but with random placement, per request, would be the best strategy. It remains unknown if a) it's possible to pad a gzip response with zeros or garbage but still have it being parsed properly by UAs, b) if it's feasible to pad with HTTP headers and how many/lengthy these should be.

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
InvalidVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez
DeclinedVgutierrez
ResolvedVgutierrez
ResolvedVgutierrez

Event Timeline

faidon raised the priority of this task from to Medium.
faidon updated the task description. (Show Details)
faidon added subscribers: faidon, csteipp, BBlack, tstarling.

a) all public and explorable via dumps, making it easier for the attacker to experiment and precompute,

I don't think dumps would be useful for precomputation, since the attacker needs to know the exact size of the HTML, not the wikitext. It would be easier to crawl, and that would work against all sites, not just Wikipedia.

b) text, assets, images are split into separate IPs and hence different TLS sessions, which both removes a randomized factor and creates even more unique combinations of traffic patterns.

Transmission of every object is preceded by a request, seen by the attacker as an encrypted flow in the reverse direction. So you can tell the size of every object, even if it is in the same TLS session as other objects. Images are requested in the order they appear on the page.

Say if you pad every object out to the next power of 2, a 25% overhead on average. Then the number of bits you leak is roughly some measure of the variation in base 2 logarithm of size, multiplied by the number of objects. For example, on https://en.wikipedia.org/wiki/Tiananmen_Square_protests_of_1989 , there are 28 images, and the standard deviation of the base 2 logarithm of the sizes of those images is 2.23, so that implies we are leaking about 62 bits of length information about that request, not including HTML size. By that metric, you would have to pad objects out to the next power of 16 to provide reasonable obfuscation.

Padding with a random number of bytes is not much better in terms of information leakage than padding to an exact size class. Unless you added a similar amount of padding (16x), you would be able to determine the article being viewed with high confidence. A feasible algorithm would be to calculate a distance metric between the observed object sizes and the object sizes of each known Wikipedia article. For example, take the object sizes into a 100-dimensional vector, with zeroes representing empty image slots, and find the minimum dot product (x-y).(x-y). It will give the right answer for well-illustrated articles unless the padding is enormous.

So I don't think HTTPS+HTML is the right protocol to provide this sort of privacy guarantee. Maybe it is possible with SPDY.

In the IETF's TLS working group, we're aiming to add a padding mechanism at the TLS layer, so that should give you a place to do the padding that won't be interfered with by gzip (or whatever other HTTP weirdness pops up) at the application layer.

If we can get consensus on that in the next few weeks (which i expect we'll be able to do), i'll try to also write it up as an extension for TLS 1.0 through 1.2.

This provides mechanism, but no explicit policy; we need to think about the policy as well, so i'm glad this discussion is happening.

tstarling wrote:

I don't think dumps would be useful for precomputation, since the attacker needs to know the exact size of the HTML, not the wikitext. It would be easier to crawl, and that would work against all sites, not just Wikipedia.

an attacker doing this kind of work can use the dumps to generate the HTML text offline to get the pagesize to a pretty reasonable degree of accuracy, without tipping their hand that they're doing so by operating a crawler.

So I do think that the fact that dumps are available is relevant to this attack on user privacy.

I agree with tstarling that generic randomized padding is unlikely to be an effective approach on its own, and with faidon that bucket-based padding with random placement might be a useful strategy.

This is a concrete attack. See, for example I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis But the attack is a statistical one. Adding padding may not make all content completely opaque, but it broadens the anonymity set substantially (especially for less-illustrated articles).

Krinkle subscribed.

(Moved to Security-General, because its publicly visible and not related to MediaWiki core)

BBlack changed the task status from Open to Stalled.Nov 15 2016, 2:26 PM

My current thinking on this is that it's best to wait on TLSv1.3's padding mechanism to be available. It's in the current draft at: https://tools.ietf.org/html/draft-ietf-tls-tls13-18#section-5.4

Note that TLSv1.3 doesn't include anything about bucketing or randomization strategies, it simply provides a mechanism for padding to be injected at the TLS layer, to be controlled by logic that's outside of the TLS spec's scope. At that point we'll probably still need to look at the TLS library and/or server and possibly patch them (currently OpenSSL+nginx, but could change by then).

Another angle to consider on all of this is the effect of HTTP/2 mixing concurrent streams before TLS record splitting. This doesn't defeat analysis, but could make it more-difficult and add on to the net effectiveness of the TLS record padding. This would especially be true if we ever merge our text and upload frontend IPs in T116132 .

My current thinking on this is that it's best to wait on TLSv1.3's padding mechanism to be available. It's in the current draft at: https://tools.ietf.org/html/draft-ietf-tls-tls13-18#section-5.4

That is now https://tools.ietf.org/html/rfc8446#section-5.4 in the approved RfC.

Do we support TLS 1.3 yet? I'm apparently connecting over 1.2 still.

Screenshot from 2019-07-12 14-32-31.png (582×768 px, 71 KB)

Note that TLSv1.3 doesn't include anything about bucketing or randomization strategies, it simply provides a mechanism for padding to be injected at the TLS layer, to be controlled by logic that's outside of the TLS spec's scope. At that point we'll probably still need to look at the TLS library and/or server and possibly patch them (currently OpenSSL+nginx, but could change by then).

For openssl, I believe this is possible with https://www.openssl.org/docs/man1.1.1/man3/SSL_set_block_padding.html

Do we support TLS 1.3 yet? I'm apparently connecting over 1.2 still.

No -- T170567: Support TLSv1.3

Krinkle changed the task status from Stalled to Open.Apr 2 2021, 8:02 PM

My current thinking on this is that it's best to wait on TLSv1.3's padding mechanism to be available. It's in the current draft at: https://tools.ietf.org/html/draft-ietf-tls-tls13-18#section-5.4

Subtask T170567 was resolved, and the approved IETF document shows that section more or less in tact https://tools.ietf.org/html/rfc8446#section-5.4.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all tickets that aren't are neither part of our current planned work nor clearly a recent, higher-priority emergent issue. This is simply one step in a larger task cleanup effort. Further triage of these tickets (and especially, organizing future potential project ideas from them into a new medium) will occur afterwards! For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

I just want to mention I saw this attack being exampled in a book about web security with Wikipedia explicitly mentioned as a vulnerable case.

it is worth noting that although HTTPS as such is a sound scheme that resists both passive and active attackers, it does very little to hide the evidence of access to a priori public information. It does not mask the rough HTTP request and response sizes, traffic directions, and timing patterns in a typical browsing session, thus making it possible for unsophisticated, passive attackers to figure out, for example, which embarrassing page on Wikipedia is being viewed by the victim over an encrypted channel.

The tangled web. Page 65

(The Tangled Web is a 2012 by Michal Zalewski)

Good find. I don't think he's explicitly trying to call out Wikipedia, just listing it as an example of public resource.

Interestingly, the next phrase provides the study https://www.microsoft.com/en-us/research/publication/side-channel-leaks-in-web-applications-a-reality-today-a-challenge-tomorrow/ where they found actual information leaks in real high-profile websites.

It's probably out of reach to fix this for users with images enabled. It may be possible to bucketize the served html for the articles, though.