Maniphest T207340

Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Oct 17 2018, 10:29 PM

Description

A user (details known to @Bawolff) reported in #wikimedia-tech that they are seeing broken images on the Wikimedia Commons home page.

Specifically, viewing https://commons.wikimedia.org/wiki/Main_Page in their browser (Firefox 62) led to a broken image (404 Not Found) for today's picture of the day, as requested from https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Sphinx_at_Universitetskaya_Embankment_(img1).jpg/500px-Sphinx_at_Universitetskaya_Embankment_(img1).jpg.

For most users (including @Joe, @faidon and myself) opening the above url results in the expected JPEG thumbnail of https://commons.wikimedia.org/wiki/File:Sphinx_at_Universitetskaya_Embankment_(img1).jpg.

But our logs confirm that the request was made to our servers, and did get a 404 Not Found response:

anonymised sample

{
  "hostname": "cp3041.esams.wmnet", ..
  "dt":"2018-10-17T20:..", ..
  "cache_status":"hit-local",
  "http_status":"404", ..
  "http_method":"GET",
  "uri_host":"upload.wikimedia.org",
  "uri_path":"/wikipedia/commons/thumb/e/ec/Sphinx_at_Universitetskaya_Embankment_%28img1%29.jpg/500px-Sphinx_at_Universitetskaya_Embankment_%28img1%29.jpg",
  "uri_query":"",
  "content_type":"text/html; charset=utf-8",
  "referer":"https://commons.wikimedia.org/",
  "user_agent":"Mozilla/5.0 (Windows .. rv:62.0) .. Firefox/62.0", ..
  "x_cache":"cp1081 pass, cp3033 hit/1, cp3041 miss"
 }

Upon closer inspection we realised this was a request with a upload.wikimedia.org directed at a cache_text Varnish server. This is odd because the upload.wikimedia.org hostname is meant to resolve to a load balancer that directs to the cache_upload cluster of Varnish servers.

In other words, the request was made by the user's browser to the wrong IP address / connection.

The 404 Not Found response is correct and expected for the given request to the given server. The question is: Why was this request directed to text-lb?

What we know so far is that it probably does affect multiple users (not an isolated incident) and the issue may've started around October 11, according to a Hive query for wmf_raw.webrequest, and a query on Turnilo for webrequest_sampled_128.

Results of the latter pictured below (src):

Details

Subject	Repo	Branch	Lines +/-
VCL: update 01-basic-caching.vtc to expect 421	operations/puppet	production	+1 -1
ATS: add {upload,maps}_domain to text_ats settings	operations/puppet	production	+2 -0
H2 coalesce 421: fix cache::canary hieradata as well	operations/puppet	production	+2 -0
VCL: Send 421 on apparently-faulty H/2 coalesce	operations/puppet	production	+22 -0

Customize query in gerrit

Related Objects

Mentioned In: T332024: GeoIP mapping experiments
T229434: All files giving 404 for some people

Event Timeline

Krinkle created this task.Oct 17 2018, 10:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 17 2018, 10:29 PM

jijiki subscribed.Oct 17 2018, 10:31 PM

Further comments on irc seem to suggest that its related to Virgin Media's "WebSafe" proxy feature that's supposed to block porn/malware:

[22:21]	Krenair	geniice, can you 'nslookup upload.wikimedia.org 8.8.8.8' ?
[22:23]	geniice	62.252.172.241

So maybe both requests get routed through the proxy, proxy passes it on via SNI, but due to same IP address HTTP2 coalescing happen in firefox (but not seamonkey)

Late edit because later comments referring to this as my explanation made me feel like I'm stealing credit: I'm paraphrasing and combining things that Krenair and _joe_ said. I didn't come up with this myself :)

Paladox subscribed.Oct 17 2018, 10:33 PM

Krenair subscribed.Oct 17 2018, 10:34 PM

Yeah I think @Bawolff's explanation seems plausible. If there's a DNS hijacking "transparent" proxy which returns the same IP for all hostnames, then this could potentially confuse H/2 coalescing for the UA, depending on how the proxy behaves.

If this is an opt-in choice of the user, we should take the opportunity to tell them it's a Bad Idea maybe, but we can't stop them.

One step we could take on our end to help mitigate these scenarios, though, is to use status code 421 from https://tools.ietf.org/html/rfc7540#section-9.1.2 . We could put this in our VCL at the Varnish layer with regexes for the opposite-cluster names that we own. For example, cache_text could explicitly return an early 421 if it seems a request to ^(maps|upload)\.wikimedia.org$, and cache_upload could do the same for any hostname other than those two, which lives within our set of 14 canonical domain names that are official on cache_text.

Bawolff added a subscriber: Geni.Oct 18 2018, 1:57 AM

• ema moved this task from Backlog to Caching on the Traffic board.Oct 22 2018, 8:32 AM

jijiki triaged this task as Medium priority.Oct 26 2018, 10:13 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Nov 5 2018, 10:06 PM

Krinkle moved this task from Watching to Perf issue on the Performance-Team (Radar) board.Jan 21 2019, 12:53 AM

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

• Phabricator_maintenance moved this task from Backlog to Acknowledged on the SRE board.Jan 26 2019, 10:32 PM

• kchapman moved this task from Blocked (old) to Radar on the Performance-Team board.Feb 19 2019, 9:47 PM

• kchapman edited projects, added Performance-Team (Radar); removed Performance-Team.

CDanis subscribed.Jul 31 2019, 3:17 PM

Change 526714 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] VCL: Send 421 on apparently-faulty H/2 coalesce

https://gerrit.wikimedia.org/r/526714

gerritbot added a project: Patch-For-Review.Jul 31 2019, 3:52 PM

Change 526714 merged by BBlack:
[operations/puppet@production] VCL: Send 421 on apparently-faulty H/2 coalesce

https://gerrit.wikimedia.org/r/526714

Maintenance_bot removed a project: Patch-For-Review.Jul 31 2019, 4:11 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-31T16:14:21Z] <bblack> deploying VCL for H/2 coalesce 421 responses - T207340

Krinkle moved this task from Perf issue to Watching on the Performance-Team (Radar) board.Jul 31 2019, 4:15 PM

The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).

You can see them by selecting out just the 421 line in:

https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?orgId=1&panelId=2&fullscreen&from=now-1h&to=now

Closing for now, re-open if there's more to do here on the server side!

CDanis mentioned this in T229434: All files giving 404 for some people.Jul 31 2019, 5:34 PM

Change 526744 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] H2 coalesce 421: fix cache::canary hieradata as well

https://gerrit.wikimedia.org/r/526744

Change 526744 merged by BBlack:
[operations/puppet@production] H2 coalesce 421: fix cache::canary hieradata as well

https://gerrit.wikimedia.org/r/526744

Maintenance_bot removed a project: Patch-For-Review.Jul 31 2019, 6:11 PM

Ladsgroup subscribed.Jul 31 2019, 8:38 PM

Change 528436 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: add {upload,maps}_domain to text_ats settings

https://gerrit.wikimedia.org/r/528436

gerritbot added a project: Patch-For-Review.Aug 6 2019, 12:07 PM

Change 528436 merged by Ema:
[operations/puppet@production] ATS: add {upload,maps}_domain to text_ats settings

https://gerrit.wikimedia.org/r/528436

Maintenance_bot removed a project: Patch-For-Review.Aug 6 2019, 1:11 PM

Change 530345 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] VCL: update 01-basic-caching.vtc to expect 421

https://gerrit.wikimedia.org/r/530345

gerritbot added a project: Patch-For-Review.Aug 15 2019, 11:03 AM

Change 530345 merged by Ema:
[operations/puppet@production] VCL: update 01-basic-caching.vtc to expect 421

https://gerrit.wikimedia.org/r/530345

Maintenance_bot removed a project: Patch-For-Review.Aug 15 2019, 1:11 PM

Krinkle mentioned this in T332024: GeoIP mapping experiments.Apr 14 2023, 4:40 AM

Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found)Closed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found)
Closed, ResolvedPublic
Actions