Page MenuHomePhabricator

upstream connect error or disconnect/reset before headers. reset reason: connection failure
Open, MediumPublicPRODUCTION ERROR

Description

I think this deserves a new issue.

Found it. This turned into:

body: upstream connect error or disconnect/reset before headers. reset reason: connection failure
channel: Math
code: 503
message: Received invalid response from restbase.

https://logstash.wikimedia.org/goto/1bdf3a8631cfcbf2c48cf6f14a04ba78

Screenshot 2023-11-07 at 15.48.09.png (626×1 px, 83 KB)

So I think upstream connect error or disconnect/reset before headers. reset reason: connection failure is now properly logged. Unfortunately we still don't know which server the response is from.

I guess that is not in the body. Probably in a response header ??
https://logstash.wikimedia.org/goto/1ce4157c2634a2bd491e052b260aae3e

  • I did look at the mathoid stats this morning. It seems as if the mathoid instances are getting very little traffic. I would be surprised if they were overwhelmed by too many requests.
  • I can add more variables to the log, like headers, but I need help accessing it.
  • Is there a pattern observable in which input causes this type of error?

Event Timeline

Physikerwelt created this task.

upstream connect error or disconnect/reset before headers as plaintext is apparently a generic error that occurs when envoy is trying to forward traffic to a service that is unavailable, see also: T287983: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage

If we look in grafana at Envoy (non-k8s), for traffic from restbase to method I do see a 500 errors indeed:
https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=mathoid&from=now-24h&to=now

I see a small sustained rate of 503s. None of this is really moving the needle when compared to other services however.
I do notice that a lot of connections of restbase to backends seem to stay open for a very long time. As in like up to an hour ? What if those connections are never freed nor reused ?

The errors are now mostly:

{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/unknown_error","method":"post","detail":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","uri":"/en.wikiversity.org/v1/media/math/check/tex"}

btw. only part of it is:
upstream connect error or disconnect/reset before headers. reset reason: connection failure as raw text (directly from envoy)

I am trying to understand if there is any relation between the math input and the error or if the error just occurs randomly.

I am trying to understand if there is any relation between the math input and the error or if the error just occurs randomly.

I don't think it's the input. If I find a page that had this error (and the error is still there, because quite often they are gone already), then just purging takes care of it.
The only pattern I can sort of discern... is that a lot of this seems to be from the editing page and api.php... That indicates JS previewing or VE rendering. Now that would be the place where most parsing happens naturally, and I do see cases where other endpoints trigger it as well, so hard to tell...

This is an interesting one however: https://en.wikipedia.org//wiki/User:Jmkim_dot_com/TeX_Samples
Here it flip flops between parser error and connection errors, whenever I purge. Each purge one of these parser errors returns with the 'connection error', but which one changes each purge....