Page MenuHomePhabricator

Update documentation for "https" field in X-Analytics
Closed, ResolvedPublic

Description

The documentation for the "https" field in https://wikitech.wikimedia.org/wiki/X-Analytics dates from 2014 and says:

If set, will be equal to "1", and indicates HTTPS protocol. This value could be set for all mobile traffic and not just Zero (and hopefully one day - on all desktop traffic too)

This is obviously outdated considering that we have been switching everything to HTTPS since 2015.

What is the current meaning of the field? It is not always set to "1" now:

SELECT x_analytics_map['https'] AS https_value, COUNT(*) AS requests 
FROM wmf.webrequest WHERE month = 3 AND day = 2
GROUP BY x_analytics_map['https'];

https_value	requests
NULL	350167351
1	7859037939
2 rows selected (232.636 seconds)

Can one assume that the NULL values correspond to HTTP requests? (I understand that such requests, say for http://en.wikipedia.org/wiki/Main_Page , are nowadays always answered with a 301 redirect to the corresponding HTTPS URL.)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

After some further research it looks like our servers nowadays answer unencrypted webrequests with a 301, not a 307 status code. ("307 Internal redirect" is what shows up in Chrome's "Inspect" network tab, but that's a fake response generated by the browser internally to indicate that it rewrote the HTTP URL to HTTPS before being sent out, according to HSTS rules. For the same reason I now suspect that my earlier guess that HTTP requests are not logged in the webrequest table wasn't correct - I'm checking this again now with curl, as a non-HSTS user agent.)

So perhaps HTTP requests are actually still logged in the webrequest table, with NULL for x_analytics_map['https'] (a "0" would be more somewhat more logical, I guess). At least, NULL/missing values correlate strongly with 301 status codes: Among requests with NULL for x_analytics_map['https'], 99% have HTTP status 301. And conversely, among requests with a 301 response, 89% have NULL for x_analytics_map['https'].

SELECT SUM(If(http_status = '301', 1,0))/SUM(1) AS 301ratio, SUM(1) AS all_requests 
FROM wmf.webrequest WHERE month = 3 AND day = 2 
AND x_analytics_map['https'] IS NULL;

301ratio	all_requests
0.989539524488678	350167351
1 row selected (440.614 seconds)

SELECT SUM(If(x_analytics_map['https'] IS NULL, 1,0))/SUM(1) AS httpsNULLratio, SUM(1) AS all_requests 
FROM wmf.webrequest WHERE month = 3 AND day = 2 
AND http_status = '301';
 
httpsnullratio	all_requests
0.8884813528214741	389996293
1 row selected (722.455 seconds)

Update: I checked again how HTTP requests are logged, this time with curl (as a client without HSTS preloading) instead of Chrome:

$ curl -v 'http://de.wikipedia.org/wiki/Pulverwaldstadion'

This request did indeed register in the webrequest table, as follows:

SELECT cache_status, http_status, response_size, x_forwarded_for, x_analytics, x_cache
FROM wmf.webrequest WHERE month = 3 AND day = 4 AND hour = 4 
AND uri_path LIKE '%Pulverwaldstadion' LIMIT 10000;

cache_status	http_status	response_size	x_forwarded_for	x_analytics	x_cache
int-front	301	0	NULL	-	cp4028 int
1 row selected (138.869 seconds)

I.e. the entire X-Analytics header was empty, so the https fields was indeed NULL, and the HTTPS status was 301 as expected.

Note that successful non-HTTPS requests evading our standard HTTPS redirect code are still possible under some circumstances. The circumstances are:

  1. The request's HTTP method must be GET or HEAD.
  2. The hostname in the Host: header must be outside of our set of canonical domains (the regex at https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L40 ).

Aside from the fact that, of course, anyone can inject any random Host-header towards any IP, we do have a long list of non-canonical domains in our DNS that we continue to support to some degree and explicitly map to our loadbalancers, mostly as insecure redirects.

As an example:

We host the DNS for the domain wikipedia.ee and point it at our standard text endpoint. HTTPS requests to this domain fail because we have no matching certificate for this. Unencrypted HTTP requests using method GET or HEAD succeed for e.g. http://wikipedia.ee, and pass through varnish to MediaWiki, and the MW apache configuration returns a 301 redirect to http://et.wikipedia.org/ , which gets sent back to the UA through Varnish. The relocation destination does match our canonical domain regex and is matched by our certs, so the second request will result in our normal HTTPS redirect and is also HSTS-preload protected in modern UAs.

In the case of the initial request to wikipedia.ee above, there is an X-Analytics header with other data (e.g. nocookies=1 in the case of a cookie-less curl), but the https field is missing and thus NULL, as expected.

While most of the well-known cases like the above result in a 301 from MediaWiki as above, it is possible for MW to be configured to return actual content rather than redirects for some of them, and this isn't routinely audited for mistakes. There are also likely to be non-canonical domains we own and point at MW which are not configured at all on the MW side, which would behave similarly to what happens when a client injects a random Host: header. In those cases, MW will return a 200 OK response over HTTP with a brief error text stating This domain points to a Wikimedia Foundation server, but is not configured on this server. (I'm not sure why these aren't 404s, but there may be a pragmatic reason). You can see this style of response by trying e.g.: curl -v -H "Host: example.org" http://en.wikipedia.org/.

All of the various sub-cases in the above paragraph could give webrequest results with status 200 (or potentially others) rather than 301, and also do not have https=1 set in their analytics fields.

Past analyses have indicated that these non-canonical domains are very statistically-tiny portions of our overall traffic, so they can be hard to notice. There are other open tickets about resolving this issue of insecure non-canonical redirects with the goal of reaching a state where all non-HTTPS traffic results in an immediate redirect to the HTTPS variant of the same URL, and all HTTPS URLs for all domains we own actually work (some matching certificate is configured at the destination). The work is, however, non-trivial and incomplete, mostly because of the large volume of non-canonical domains we own.

Very informative, thanks @BBlack! So I understand that these cases would explain the 1% of requests in T188807#4021737 that have NULL for x_analytics_map['https'], but an HTTP status other than 301.

And is it correct to assume besides those HTTP --> HTTPS redirects, there are other cases where we send a 301 reponse? (explaining the 11% missing in T188807#4021737 )

Getting back to the task description: How confident are we by now that a NULL value for x_analytics_map['https'] always corresponds to a HTTP request?

And is it correct to assume besides those HTTP --> HTTPS redirects, there are other cases where we send a 301 reponse? (explaining the 11% missing in T188807#4021737 )

Yes, there are many cases where a 301 is returned for an HTTPS request (which are unrelated to HTTP->HTTPS redirects).

Getting back to the task description: How confident are we by now that a NULL value for x_analytics_map['https'] always corresponds to a HTTP request?

Looking at this from a perspective of the logic/branching of the VCL code:

  • Trivially, the only value we set for https is 1. Therefore there are only two possible values: 1 or NULL.
  • It's important to distinguish public requests from internal requests. When internal services send requests to our public endpoints (e.g. for some reason MediaWiki makes a sub-request back to en.wikipedia.org again), we allow them to fake that they're using HTTPS when they might not be. In these cases, https=1 might be set, when in fact the protocol was un-encrypted HTTP. These requests are internal to our own network, though. It's not possible for external 3rd parties to fake the HTTPS-ness of a request in this fashion.
  • Setting aside the exception above, it's pretty easy to prove in the logic that for requests from truly external clients, https=1 is never set if the protocol was unencrypted HTTP.
  • However, the question you're asking is the opposite of this, and is trickier to answer. It can be re-phrased as "Is it possible for an encrypted HTTPS request to have a NULL value for for x_analytics_map['https']?"
    • For response codes <400 (non-errors): the confidence of a "no" should be very high, as all non-error paths clearly call the code in question to set the https=1 field for HTTPS requests.
    • For most cases involving response codes >= 400, it's also easy to prove that logically, https=1 is always set for HTTPS requests.
    • However, there are almost certainly stranger and/or rarer situations where an HTTPS request may result in an status code of 400 or higher and not get tagged with https=1.
    • For example, there are probably cases where Varnish can immediately respond to an encrypted HTTPS request with an error code like 400 Bad Request and never invoke our analytics delivery-time code, and thus the whole x_analytics field will be NULL even though the request was encrypted. There may be other edge-cases around other rare-ish error conditions as well. Analyzing for all such cases turns out to be a fairly deep and thorny thing to be completely sure about, unfortunately.

@BBlack Thanks again! Back to the task at hand: I have tentatively updated the documentation based on my understanding of your remarks: https://wikitech.wikimedia.org/w/index.php?title=X-Analytics&diff=1785092&oldid=1780937
Could you check that I didn't misunderstand anything?
Also, considering that the previously noted contact person for this field is long gone, I took the liberty to list you and the Traffic team instead - hope that works for you folks.

Closing this as I think documentation around https field has been done.