Page MenuHomePhabricator

Investigate and fix odd uri_host values
Open, NormalPublic

Description

If current webrequest data is to be believed, we are counting some traffic to other websites (by Google, Facebook, Tencent, reference.com, ...) as our own.

Also, why are requests to internal IPs (10.25....) being logged?

SELECT uri_host, COUNT(*) AS requests FROM wmf.webrequest WHERE year = 2018 AND month = 3 AND day = 2 AND uri_host NOT LIKE '%wik%' AND uri_host NOT LIKE '%mfusercontent.org' GROUP BY uri_host ORDER BY requests DESC LIMIT 50;

uri_host	requests
varnishcheck	2126018
-	338275
www.site	84440
site	82470
198.35.26.96	63687
91.198.174.192	43220
208.80.153.224	40842
208.80.154.224	9922
10.25.4.8	8638
10.25.5.11	8638
10.25.5.7	8638
10.25.5.6	8638
10.25.4.10	8637
10.25.5.9	8637
10.25.5.8	8637
10.25.4.7	8637
10.25.4.6	8637
10.25.4.11	8637
10.25.4.9	8637
10.25.5.10	8636
android.clients.google.com	6727
dictionary1.classic.reference.com	4030
clients2.google.com	2985
www.googleapis.com	2938
158.69.104.114	2779
c.data.mob.com	2574
www.googleadservices.com	1868
googleads.g.doubleclick.net	1767
graph.facebook.com	1576
goupdate.3g.cn	1560
bongobongo.tk	1474
babau.ml	1463
babau.gq	1461
7uy35p.tk	1448
play.googleapis.com	1435
video.fmnl3-1.fna.fbcdn.net	1420
connectivitycheck.gstatic.com	1394
clients4.google.com	1217
www.youtube.com	1161
captive.apple.com	992
log.apk.v-mate.mobi	992
198.35.26.112	893
www.google.com	882
clients3.google.com	873
android.bugly.qq.com	706
glu-apac.s3.amazonaws.com	678
abtest.goforandroid.com	675
tpc.googlesyndication.com	646
video.fmnl4-5.fna.fbcdn.net	645
mail.google.com	641
50 rows selected (244.202 seconds)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2018, 4:07 AM
Restricted Application added a project: Operations. · View Herald TranscriptMar 5 2018, 5:22 PM
fdans moved this task from Incoming to Radar on the Analytics board.Mar 5 2018, 5:23 PM
BBlack added a subscriber: BBlack.Mar 5 2018, 5:46 PM

The bottom line is that the value of uri_host is entirely up to the client, and therefore subject to client-side stupidity. It's legal (in all protocol senses) for a client to connect to our public IP address over HTTP or HTTPS (in the latter case, legitimately matching an SSL certificate for e.g. en.wikipedia.org), and then send an HTTP(S) request that looks like...

GET / HTTP/1.1
Host: i.really.like.doing.silly.things.example.org

... in which case we'll log a request with a uri_host value of i.really.like.doing.silly.things.example.org. In practice for all the random unknown values that can be plugged into there, it will be up to MediaWiki to reject such a request with an appropriate error (which as recently discussed in another ticket, ends up being an error output page with HTTP status 200).

There are valid reasons we cannot and should not enforce matching the SSL-level hostname used for certificate-matching to the HTTP-level Host: header, the most-obvious of which is that HTTP/2 can tunnel requests for several of our hostnames through the same open connection if it realizes our cert wildcards and IP addresses match for all the hostnames involved. In any case we'd still have the same problem for un-encrypted HTTP requests where there's nothing else to validate by.

One thing we could do is match against some regex of all known legitimate domainnames we own, but this is also costly and complicated vs just letting these errors fly and hit MediaWiki as they are. Even if we did check legitimacy in this sense, we'd still owe the client an error response of some kind anyways (e.g. 404), so we'd still be logging whatever crazy hostnames they send us when we log the 404 response regardless.

All of that being said, a few of the values shown above warrant some explanation:

  • varnishcheck - These are internal healthcheck traffic. Arguably, they could/should be excluded from webrequest logging completely. Maybe we should make a ticket about that.
  • 198.35.26.96, 91.198.174.192, 208.80.153.224, 208.80.154.224 - These are our actual public IPs that clients connect to. So in some sense, they're more-legitimate values for a client to send than most of the rest of the list. However, requests with such a hostname don't do anything useful for the client in practice, so they are still a client-side error.
  • The rest are all junk that can be binned up as general-case client stupidity of some kind or other. In particular, note that 10.25.0.0/16 is not actually one of our legitimate internal subnets, and therefore those IPs have nothing to do with our infrastructure. If those IPs have any meaning, it's a meaning specific to that client's own internal infrastructure/configuration and it's leaking those (proxy?) IPs to us as Host-header values erroneously.
Nuria added a subscriber: Nuria.Mar 7 2018, 5:51 AM
This comment was removed by Nuria.
ema moved this task from Triage to Watching on the Traffic board.Mar 7 2018, 4:45 PM
ema triaged this task as Normal priority.Mar 29 2018, 8:27 AM