Page MenuHomePhabricator

upstream connect error or disconnect/reset before headers. reset reason: overflow
Closed, ResolvedPublic

Description

Update: 2022-03-10:

Please read:

If you ended up searching and finding this task because of a

upstream connect error or disconnect/reset before headers. reset reason: overflow

error message that you received, please know that this is a symptom of a number of different problems that can appear lower in our tech stack. It is not a cause in itself and doesn't help in pinpointing what is happening. It only tells us something is probably happening. Chances are that the SRE team is probably well aware already and are coordinating in IRC in the #wikimedia-operations channel to investigate the actual problem.

Original task text - 2022-02-10

I'm facing this error (a lot of times) when I'm trying to reach English Wikipedia page(s).

Started on 10 February 2022, 19:29 UTC

upstream connect error or disconnect/reset before headers. reset reason: overflow

Capture.PNG (87×747 px, 2 KB)

Event Timeline

It's intermittent for me, but I can reproduce.

Same for me; but was going slow for some minues prior.

Hello everyone, this is a known incident and it is being worked on by Wikimedia Foundation staff. Thank you for your patience.

Ladsgroup added a parent task: Restricted Task.Feb 10 2022, 8:00 PM
Ladsgroup claimed this task.

We think this is resolved now. If you still can't access the wikis, please let us know.

We think this is resolved now. If you still can't access the wikis, please let us know.

4nn1l2 reopened this task as Open.EditedFeb 11 2022, 6:48 PM
4nn1l2 subscribed.

Not resolved yet.

Saw it again at 18:45, 11 February 2022 (UTC) when trying to save an edit to https://commons.wikimedia.org/wiki/Commons:Administrators/Inactivity_section/Feb-Mar_2022.

I believe that one took only five minutes and Wikipedia has been accessible after that. So I close this again. Thank you for understanding.

MZMcBride subscribed.

This issue is still happening.

Thanks for letting us know! We did indeed have this issue again for a few minutes earlier (intermittently between 02:36 and 03:00 UTC) but things are back to normal now. Sorry for the inconvenience, and more permanent solutions are in progress to keep this from happening again.

Happened again between 12:25 and 12:28 UTC but things are back to normal now.

Where is documentation of this issue? The returns make think it is not resolved.

Hi! This resurfaced during the weekend. It is not a single issue (despite appearances), rather the message "upstream connect error or disconnect/reset before headers. reset reason: overflow" that we see can be a symptom of various causes. It is not resolved yet and has a chance of reappearing over the next few days/weeks while we work on what we hope will be a set of more permanent solutions.

Hi! This resurfaced during the weekend. It is not a single issue (despite appearances), rather the message "upstream connect error or disconnect/reset before headers. reset reason: overflow" that we see can be a symptom of various causes. It is not resolved yet and has a chance of reappearing over the next few days/weeks while we work on what we hope will be a set of more permanent solutions.

Hi. :-) Thank you for the update and clarification. I just encountered this error again just now at https://en.wikipedia.org/wiki/Special:Contributions/OldEagles.

We had a occurrence of this a couple of hours ago, will post more details soon.

We had a occurrence of this a couple of hours ago, will post more details soon.

As promised: https://wikitech.wikimedia.org/wiki/Incident_documentation/2022-03-10_MediaWiki_availability

Reading through the incident doc makes it obvious that the error pasted here is a symptom (and one in fact multiple layers away from the actual causes of the issue) and not the reason for this. The same error can be emitted due to other reasons as well and is not limited to the database layer malfunctioning.

I am going to alter the text of the task to reflect the above and lower priority due to that.

akosiaris lowered the priority of this task from High to Low.Mar 10 2022, 10:44 AM
akosiaris updated the task description. (Show Details)

In general, shouldn't phabricator tickets be one ticket = one cause? This one seems like it may be one ticket = many causes since it's such a generic error message. I've intermittently gotten this error message half a dozen times over the last year and I always assumed it was random-ish ops stuff that was the cause.

Perhaps this should be closed?

It was recently linked at enwiki village pump technical for an incident today, but I would suspect the original cause of this months-old ticket is unrelated.

Hi!

In general, shouldn't phabricator tickets be one ticket = one cause? This one seems like it may be one ticket = many causes since it's such a generic error message. I've intermittently gotten this error message half a dozen times over the last year and I always assumed it was random-ish ops stuff that was the cause.

While there is some good sense in trying to have 1 task tracking 1 issue/symptom, that's not always possible. 1 ticket tracking 1 cause is rarely doable as issues/symptom might (and often do) have >1 causes. This holds true for this task as well. There are multiple causes for this generic error message to appear. You don't need to assume though it is random-ish ops stuff. When we are able, we publish Incident Reports under https://wikitech.wikimedia.org/wiki/Incident_status.

Perhaps this should be closed?

The only reason it stays open is so that people can find it easily and don't end up creating new ones (it does appear to be working) when this issue/symptom appears.

It was recently linked at enwiki village pump technical for an incident today, but I would suspect the original cause of this months-old ticket is unrelated.

Your suspicion would be probably correct, but we are leaving this open for when this issue appears, not the causes of it back then.

I just received one trying to look up https://en.wiktionary.org/wiki/stuff. Not reproducible it seems.

Aklapper added a subscriber: RLazarus.

@RLazarus: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

@akosiaris I would be inclined to close this task or would you prefer to leave it open as a pointer for users searching for the error message?

akosiaris claimed this task.

@akosiaris I would be inclined to close this task or would you prefer to leave it open as a pointer for users searching for the error message?

It's 1.5 years later and we haven't had much of this happening lately. Apparently the actions we 've taken elsewhere in the infrastructure (requestctl et al) make the actual causes that were making this error appear to users not happen anymore. I 'll resolve and hopefully we won't have to reopen it.