upstream connect error or disconnect/reset before headers. reset reason: connection failure
Open, MediumPublicPRODUCTION ERROR
Actions

Assigned To

None

Authored By

	Physikerwelt
	Nov 7 2023, 6:45 PM

Description

I think this deserves a new issue.

In T349347#9312756, @TheDJ wrote:

Found it. This turned into:

body: upstream connect error or disconnect/reset before headers. reset reason: connection failure
channel: Math
code: 503
message: Received invalid response from restbase.

https://logstash.wikimedia.org/goto/1bdf3a8631cfcbf2c48cf6f14a04ba78

So I think upstream connect error or disconnect/reset before headers. reset reason: connection failure is now properly logged. Unfortunately we still don't know which server the response is from.

I guess that is not in the body. Probably in a response header ??
https://logstash.wikimedia.org/goto/1ce4157c2634a2bd491e052b260aae3e

I did look at the mathoid stats this morning. It seems as if the mathoid instances are getting very little traffic. I would be surprised if they were overwhelmed by too many requests.
I can add more variables to the log, like headers, but I need help accessing it.
Is there a pattern observable in which input causes this type of error?

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	PRODUCTION ERROR	Physikerwelt	T349347 MathRestbaseInterface: PHP Notice: Trying to get property '[property]' of non-object
		Open	PRODUCTION ERROR	None	T350717 upstream connect error or disconnect/reset before headers. reset reason: connection failure

Event Timeline

Physikerwelt triaged this task as Medium priority.Nov 7 2023, 6:45 PM

Physikerwelt created this task.

Physikerwelt moved this task from Inbox to Blocked: needs help on the Math board.Nov 7 2023, 8:38 PM

upstream connect error or disconnect/reset before headers as plaintext is apparently a generic error that occurs when envoy is trying to forward traffic to a service that is unavailable, see also: T287983: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage

If we look in grafana at Envoy (non-k8s), for traffic from restbase to method I do see a 500 errors indeed:
https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=mathoid&from=now-24h&to=now

I see a small sustained rate of 503s. None of this is really moving the needle when compared to other services however.
I do notice that a lot of connections of restbase to backends seem to stay open for a very long time. As in like up to an hour ? What if those connections are never freed nor reused ?

The errors are now mostly:

{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/unknown_error","method":"post","detail":"upstream connect error or disconnect/reset before headers. reset reason: connection termination","uri":"/en.wikiversity.org/v1/media/math/check/tex"}

btw. only part of it is:
upstream connect error or disconnect/reset before headers. reset reason: connection failure as raw text (directly from envoy)

I am trying to understand if there is any relation between the math input and the error or if the error just occurs randomly.

In T350717#9315596, @Physikerwelt wrote:

I am trying to understand if there is any relation between the math input and the error or if the error just occurs randomly.

I don't think it's the input. If I find a page that had this error (and the error is still there, because quite often they are gone already), then just purging takes care of it.
The only pattern I can sort of discern... is that a lot of this seems to be from the editing page and api.php... That indicates JS previewing or VE rendering. Now that would be the place where most parsing happens naturally, and I do see cases where other endpoints trigger it as well, so hard to tell...

This is an interesting one however: https://en.wikipedia.org//wiki/User:Jmkim_dot_com/TeX_Samples
Here it flip flops between parser error and connection errors, whenever I purge. Each purge one of these parser errors returns with the 'connection error', but which one changes each purge....

brennen moved this task from Backlog to Logs/Train on the User-brennen board.Dec 1 2023, 8:10 PM

brennen unsubscribed.

Krinkle moved this task from Untriaged to Nov 2023 on the Wikimedia-production-error board.Feb 8 2024, 4:22 PM

upstream connect error or disconnect/reset before headers. reset reason: connection failureOpen, MediumPublicPRODUCTION ERRORActions

Description

Related ObjectsSearch...

Event Timeline

upstream connect error or disconnect/reset before headers. reset reason: connection failure
Open, MediumPublicPRODUCTION ERROR
Actions

Related Objects
Search...