Page MenuHomePhabricator

Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found)
Closed, ResolvedPublic

Description

A user (details known to @Bawolff) reported in #wikimedia-tech that they are seeing broken images on the Wikimedia Commons home page.

Specifically, viewing https://commons.wikimedia.org/wiki/Main_Page in their browser (Firefox 62) led to a broken image (404 Not Found) for today's picture of the day, as requested from https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Sphinx_at_Universitetskaya_Embankment_(img1).jpg/500px-Sphinx_at_Universitetskaya_Embankment_(img1).jpg.

For most users (including @Joe, @faidon and myself) opening the above url results in the expected JPEG thumbnail of https://commons.wikimedia.org/wiki/File:Sphinx_at_Universitetskaya_Embankment_(img1).jpg.

But our logs confirm that the request was made to our servers, and did get a 404 Not Found response:

anonymised sample
{
  "hostname": "cp3041.esams.wmnet", ..
  "dt":"2018-10-17T20:..", ..
  "cache_status":"hit-local",
  "http_status":"404", ..
  "http_method":"GET",
  "uri_host":"upload.wikimedia.org",
  "uri_path":"/wikipedia/commons/thumb/e/ec/Sphinx_at_Universitetskaya_Embankment_%28img1%29.jpg/500px-Sphinx_at_Universitetskaya_Embankment_%28img1%29.jpg",
  "uri_query":"",
  "content_type":"text/html; charset=utf-8",
  "referer":"https://commons.wikimedia.org/",
  "user_agent":"Mozilla/5.0 (Windows .. rv:62.0) .. Firefox/62.0", ..
  "x_cache":"cp1081 pass, cp3033 hit/1, cp3041 miss"
 }

Upon closer inspection we realised this was a request with a upload.wikimedia.org directed at a cache_text Varnish server. This is odd because the upload.wikimedia.org hostname is meant to resolve to a load balancer that directs to the cache_upload cluster of Varnish servers.

In other words, the request was made by the user's browser to the wrong IP address / connection.

The 404 Not Found response is correct and expected for the given request to the given server. The question is: Why was this request directed to text-lb?

What we know so far is that it probably does affect multiple users (not an isolated incident) and the issue may've started around October 11, according to a Hive query for wmf_raw.webrequest, and a query on Turnilo for webrequest_sampled_128.

Results of the latter pictured below (src):

capture.png (1×2 px, 101 KB)

Event Timeline

Further comments on irc seem to suggest that its related to Virgin Media's "WebSafe" proxy feature that's supposed to block porn/malware:

[22:21]	Krenair	geniice, can you 'nslookup upload.wikimedia.org 8.8.8.8' ?
[22:23]	geniice	62.252.172.241

So maybe both requests get routed through the proxy, proxy passes it on via SNI, but due to same IP address HTTP2 coalescing happen in firefox (but not seamonkey)

Late edit because later comments referring to this as my explanation made me feel like I'm stealing credit: I'm paraphrasing and combining things that Krenair and _joe_ said. I didn't come up with this myself :)

Yeah I think @Bawolff's explanation seems plausible. If there's a DNS hijacking "transparent" proxy which returns the same IP for all hostnames, then this could potentially confuse H/2 coalescing for the UA, depending on how the proxy behaves.

If this is an opt-in choice of the user, we should take the opportunity to tell them it's a Bad Idea maybe, but we can't stop them.

One step we could take on our end to help mitigate these scenarios, though, is to use status code 421 from https://tools.ietf.org/html/rfc7540#section-9.1.2 . We could put this in our VCL at the Varnish layer with regexes for the opposite-cluster names that we own. For example, cache_text could explicitly return an early 421 if it seems a request to ^(maps|upload)\.wikimedia.org$, and cache_upload could do the same for any hostname other than those two, which lives within our set of 14 canonical domain names that are official on cache_text.

jijiki triaged this task as Medium priority.Oct 26 2018, 10:13 AM

Change 526714 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: Send 421 on apparently-faulty H/2 coalesce

https://gerrit.wikimedia.org/r/526714

Change 526714 merged by BBlack:
[operations/puppet@production] VCL: Send 421 on apparently-faulty H/2 coalesce

https://gerrit.wikimedia.org/r/526714

Mentioned in SAL (#wikimedia-operations) [2019-07-31T16:14:21Z] <bblack> deploying VCL for H/2 coalesce 421 responses - T207340

BBlack claimed this task.

The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).

You can see them by selecting out just the 421 line in:

https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&panelId=2&fullscreen&from=now-1h&to=now

Closing for now, re-open if there's more to do here on the server side!

Change 526744 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] H2 coalesce 421: fix cache::canary hieradata as well

https://gerrit.wikimedia.org/r/526744

Change 526744 merged by BBlack:
[operations/puppet@production] H2 coalesce 421: fix cache::canary hieradata as well

https://gerrit.wikimedia.org/r/526744

Change 528436 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add {upload,maps}_domain to text_ats settings

https://gerrit.wikimedia.org/r/528436

Change 528436 merged by Ema:
[operations/puppet@production] ATS: add {upload,maps}_domain to text_ats settings

https://gerrit.wikimedia.org/r/528436

Change 530345 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: update 01-basic-caching.vtc to expect 421

https://gerrit.wikimedia.org/r/530345

Change 530345 merged by Ema:
[operations/puppet@production] VCL: update 01-basic-caching.vtc to expect 421

https://gerrit.wikimedia.org/r/530345