Page MenuHomePhabricator

Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage
Closed, ResolvedPublic

Description

Splitting from T287362: General site outage caused by ruwikinews usage of DPL: "upstream connect error or disconnect/reset before headers. reset reason: overflow", during the outage users saw a raw, plaintext error message with the content "upstream connect error or disconnect/reset before headers. reset reason: overflow". Instead, they should have seen the standard "Wikimedia Error" HTML page.

dddddd.PNG (102×844 px, 3 KB)

I'm guess-tagging this as Traffic, please retag if I'm wrong.

Event Timeline

RLazarus subscribed.

FWIW, this error message comes from Envoy, when it's received a request but can't connect to the backend it's supposed to forward it to (source). My unsubstantiated guess is that it's the app server's Envoy, terminating TLS from the edge and trying to forward to the local Apache, which never answered because MediaWiki workers were saturated.

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

This might be helpful: T113114: Make all wiki-facing error pages consistent
I think apache has a page for 5xx errors, we can probably reuse that.

FWIW, this error message comes from Envoy, when it's received a request but can't connect to the backend it's supposed to forward it to (source). My unsubstantiated guess is that it's the app server's Envoy, terminating TLS from the edge and trying to forward to the local Apache, which never answered because MediaWiki workers were saturated.

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

We currently already have a mechanism in place at the edge to return a standard error page whenever the origin server (Envoy in this case) returns a 4xx or 5xx error without response body. So essentially in order to achieve what @RLazarus proposes here we would have to make Envoy return no body, if possible.

@ema That makes sense, thanks for the pointer. @Joe mentioned on IRC that he might rather serve the error from Envoy -- sounds like that question is worth a little discussion.

Either way, when we get to the point of making the change, it looks like it's a pretty straightforward config setting to either override the body to a blank string or read it from a file. (If the latter, of course we'll need to remember to also override the content-type from text/plain.)

If that's the case, there's no way we could have served an error page from MediaWiki. In principle I think these error replies are configurable at Envoy, but instead I think we should probably adapt whatever logic exists at the edge to handle this situation. Before we inserted Envoy into the stack, ats-be would have been the one failing to connect to Apache, and (I think?) would have sent a nice error page -- now, instead, ats-be gets this error from Envoy but I propose it should rewrite it to the same nice error page.

Right, to clarify, I expected the error page to come from Varnish/caches. I think it's fine if it comes from envoy, but it should look the same as our normal error pages.

Either way, when we get to the point of making the change, it looks like it's a pretty straightforward config setting to either override the body to a blank string or read it from a file. (If the latter, of course we'll need to remember to also override the content-type from text/plain.)

If we make envoy return a blank body, are we going to lose the error message? Ideally that would still be included somewhere.

If we make envoy return a blank body, are we going to lose the error message?

Indeed. HTTP Status and Reason would be preserved and shown in the Varnish-generated error page though, so if Envoy sets Reason to something more specific than Internal Server Error, then we can reuse the error page logic and template used by Varnish and still get a useful error page. Otherwise we need to make Envoy read and render ./modules/mediawiki/templates/errorpage.html.erb.

RLazarus added a parent task: Restricted Task.Mar 10 2022, 5:10 PM
akosiaris subscribed.

Removing SRE, has already been triaged to a more specific SRE subteam

I see two ways of fixing this:

  1. Make varnish consider the special case of envoy doing circuit breaking
  2. Make envoy return an empty response using the local_reply configuration https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/local_reply

We currently already have a mechanism in place at the edge to return a standard error page whenever the origin server (Envoy in this case) returns a 4xx or 5xx error without response body. So essentially in order to achieve what @RLazarus proposes here we would have to make Envoy return no body, if possible.

Please note that the mechanism described by @ema has some issues covered by T324956

We currently already have a mechanism in place at the edge to return a standard error page whenever the origin server (Envoy in this case) returns a 4xx or 5xx error without response body. So essentially in order to achieve what @RLazarus proposes here we would have to make Envoy return no body, if possible.

Please note that the mechanism described by @ema has some issues covered by T324956

So your proposal would be to send back the wikipedia error page directly from envoy? Or to add a separate special handling in vcl for envoy errors?

happy to go with the empty body response from envoy, just be aware that it will be reported as X-Cache: int till T324956 is fixed

Change 901679 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] mesh.configuration: add support for custom error pages

https://gerrit.wikimedia.org/r/901679

Change 901767 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] tegola-vector-tiles: update to mesh 1.1

https://gerrit.wikimedia.org/r/901767

Change 901769 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] charts: upgrade to mesh 1.1

https://gerrit.wikimedia.org/r/901769

Change 902058 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] envoyproxy::tls_terminator: allow returning an HTML error page

https://gerrit.wikimedia.org/r/902058

Change 902058 merged by Giuseppe Lavagetto:

[operations/puppet@production] envoyproxy::tls_terminator: allow returning an HTML error page

https://gerrit.wikimedia.org/r/902058

Change 902238 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] appserver: send back a proper error page from envoy

https://gerrit.wikimedia.org/r/902238

Change 902238 merged by Giuseppe Lavagetto:

[operations/puppet@production] appserver: send back a proper error page from envoy

https://gerrit.wikimedia.org/r/902238

Change 902447 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::tlsproxy::yaml_defs: add error page to envoy

https://gerrit.wikimedia.org/r/902447

Change 901767 abandoned by Giuseppe Lavagetto:

[operations/deployment-charts@master] tegola-vector-tiles: update to mesh 1.1

Reason:

Not needed anymore

https://gerrit.wikimedia.org/r/901767

Change 902447 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki::tlsproxy::yaml_defs: add error page to envoy

https://gerrit.wikimedia.org/r/902447

Change 901679 merged by jenkins-bot:

[operations/deployment-charts@master] mesh.configuration: add support for custom error pages

https://gerrit.wikimedia.org/r/901679

Change 901769 merged by jenkins-bot:

[operations/deployment-charts@master] charts: upgrade to mesh 1.1

https://gerrit.wikimedia.org/r/901769