Page MenuHomePhabricator

Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage
Open, LowPublic

Description

Splitting from T287362: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow", during the outage users saw a raw, plaintext error message with the content "upstream connect error or disconnect/reset before headers. reset reason: overflow". Instead, they should have seen the standard "Wikimedia Error" HTML page.

dddddd.PNG (102×844 px, 3 KB)

I'm guess-tagging this as Traffic, please retag if I'm wrong.

Event Timeline

RLazarus added a subscriber: RLazarus.

FWIW, this error message comes from Envoy, when it's received a request but can't connect to the backend it's supposed to forward it to (source). My unsubstantiated guess is that it's the app server's Envoy, terminating TLS from the edge and trying to forward to the local Apache, which never answered because MediaWiki workers were saturated.

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

This might be helpful: T113114: Make all wiki-facing error pages consistent
I think apache has a page for 5xx errors, we can probably reuse that.

FWIW, this error message comes from Envoy, when it's received a request but can't connect to the backend it's supposed to forward it to (source). My unsubstantiated guess is that it's the app server's Envoy, terminating TLS from the edge and trying to forward to the local Apache, which never answered because MediaWiki workers were saturated.

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

We currently already have a mechanism in place at the edge to return a standard error page whenever the origin server (Envoy in this case) returns a 4xx or 5xx error without response body. So essentially in order to achieve what @RLazarus proposes here we would have to make Envoy return no body, if possible.

@ema That makes sense, thanks for the pointer. @Joe mentioned on IRC that he might rather serve the error from Envoy -- sounds like that question is worth a little discussion.

Either way, when we get to the point of making the change, it looks like it's a pretty straightforward config setting to either override the body to a blank string or read it from a file. (If the latter, of course we'll need to remember to also override the content-type from text/plain.)

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

Right, to clarify, I expected the error page to come from Varnish/caches. I think it's fine if it comes from envoy, but it should look the same as our normal error pages.

Either way, when we get to the point of making the change, it looks like it's a pretty straightforward config setting to either override the body to a blank string or read it from a file. (If the latter, of course we'll need to remember to also override the content-type from text/plain.)

If we make envoy return a blank body, are we going to lose the error message? Ideally that would still be included somewhere.

If we make envoy return a blank body, are we going to lose the error message?

Indeed. HTTP Status and Reason would be preserved and shown in the Varnish-generated error page though, so if Envoy sets Reason to something more specific than Internal Server Error, then we can reuse the error page logic and template used by Varnish and still get a useful error page. Otherwise we need to make Envoy read and render ./modules/mediawiki/templates/errorpage.html.erb.