Page MenuHomePhabricator

Investigate and fix odd uri_host values
Open, MediumPublic


If current webrequest data is to be believed, we are counting some traffic to other websites (by Google, Facebook, Tencent,, ...) as our own.

Also, why are requests to internal IPs (10.25....) being logged?

SELECT uri_host, COUNT(*) AS requests FROM wmf.webrequest WHERE year = 2018 AND month = 3 AND day = 2 AND uri_host NOT LIKE '%wik%' AND uri_host NOT LIKE '' GROUP BY uri_host ORDER BY requests DESC LIMIT 50;

uri_host	requests
varnishcheck	2126018
-	338275	84440
site	82470	63687	43220	40842	9922	8638	8638	8638	8638	8637	8637	8637	8637	8637	8637	8637	8636	6727	4030	2985	2938	2779	2574	1868	1767	1576	1560	1474	1463	1461	1448	1435	1420	1394	1217	1161	992	992	893	882	873	706	678	675	646	645	641
50 rows selected (244.202 seconds)

Event Timeline

The bottom line is that the value of uri_host is entirely up to the client, and therefore subject to client-side stupidity. It's legal (in all protocol senses) for a client to connect to our public IP address over HTTP or HTTPS (in the latter case, legitimately matching an SSL certificate for e.g., and then send an HTTP(S) request that looks like...

GET / HTTP/1.1

... in which case we'll log a request with a uri_host value of In practice for all the random unknown values that can be plugged into there, it will be up to MediaWiki to reject such a request with an appropriate error (which as recently discussed in another ticket, ends up being an error output page with HTTP status 200).

There are valid reasons we cannot and should not enforce matching the SSL-level hostname used for certificate-matching to the HTTP-level Host: header, the most-obvious of which is that HTTP/2 can tunnel requests for several of our hostnames through the same open connection if it realizes our cert wildcards and IP addresses match for all the hostnames involved. In any case we'd still have the same problem for un-encrypted HTTP requests where there's nothing else to validate by.

One thing we could do is match against some regex of all known legitimate domainnames we own, but this is also costly and complicated vs just letting these errors fly and hit MediaWiki as they are. Even if we did check legitimacy in this sense, we'd still owe the client an error response of some kind anyways (e.g. 404), so we'd still be logging whatever crazy hostnames they send us when we log the 404 response regardless.

All of that being said, a few of the values shown above warrant some explanation:

  • varnishcheck - These are internal healthcheck traffic. Arguably, they could/should be excluded from webrequest logging completely. Maybe we should make a ticket about that.
  •,,, - These are our actual public IPs that clients connect to. So in some sense, they're more-legitimate values for a client to send than most of the rest of the list. However, requests with such a hostname don't do anything useful for the client in practice, so they are still a client-side error.
  • The rest are all junk that can be binned up as general-case client stupidity of some kind or other. In particular, note that is not actually one of our legitimate internal subnets, and therefore those IPs have nothing to do with our infrastructure. If those IPs have any meaning, it's a meaning specific to that client's own internal infrastructure/configuration and it's leaking those (proxy?) IPs to us as Host-header values erroneously.
ema triaged this task as Medium priority.Mar 29 2018, 8:27 AM

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!