ChieBot: Intermittent connection reset by peer errors
Open, HighPublicBUG REPORT
Actions

Assigned To

None

Authored By

	Leloiandudu
	Jan 30 2024, 9:33 AM

Description

My tool started getting intermittent connection reset by peer errors in the past few days. The tool automatically retries the connection after a 1 minute timeout up to 5 times and most of the time it is not enough:

11:50:04 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:51:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:52:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:53:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:54:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
After 5 retries: System.Net.WebException: Unable to read data from the transport connection: Connection reset by peer. ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer. ---> System.Net.Sockets.SocketException: Connection reset by peer

This coincided with the recent migration to k8s, but I'm not sure if it is actually related. My tool has been running successfully for over a decade without encountering such problems.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		dcaro	T356164 [toolforge] several tools get periods of connection refused (104) when connecting to wikis
		Open	BUG REPORT	None	T356163 ChieBot: Intermittent connection reset by peer errors

Event Timeline

Leloiandudu created this task.Jan 30 2024, 9:33 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 30 2024, 9:33 AM

dcaro renamed this task from Intermittent connection reset by peer errors to ChieBot: Intermittent connection reset by peer errors.Jan 30 2024, 9:34 AM

dcaro mentioned this in T356160: Listeria bot sometimes gets stuck with 104 errors from Wikimedia APIs.

dcaro added a parent task: T356164: [toolforge] several tools get periods of connection refused (104) when connecting to wikis.Jan 30 2024, 9:37 AM

Just stating for the record that connection refused/reset messages will come from our edge caching layer, specifically from the tcp stack of our servers there, so it wouldn't be related to a migration to kubernetes (which is still only partial, btw).

Fabfur subscribed.Jan 30 2024, 9:57 AM

dcaro triaged this task as High priority.Jan 30 2024, 4:13 PM

dcaro moved this task from Backlog to Workspace for triaging whenever needed on the Toolforge board.

In T356163#9497689, @Joe wrote:

migration to kubernetes (which is still only partial, btw).

I was talking about the migration of my tool. It's now running on k8s 100%.

@Joe thanks! Yes, the issue is unrelated to the k8s workers, we were just hitting the limit of concurrent connections to the CDN per-ip.

In that sense, @Leloiandudu have you had any issues lately? We found that there was another tool doing many requests in parallel, that might have affected all the tools running on the same worker node (including yours), but the maintainer is working on fixing that, so should be solved soon.

If your tool does requests in parallel, just make sure to do only a few, as we can only have 500 per node, that is shared with many other tools, after that is reached, the CDN will drop all requests for 300s.

Also, reminder in case you were not doing it, to use a proper user-agent (https://meta.wikimedia.org/wiki/User-Agent_policy) when doing requests to the wikis, that will help you avoid getting rate-limited and help the wikis know the requests come from a tool (and which one).

If you are still finding issues, please let me know so I can debug and try to figure out which other tools might be doing many requests in parallel to help them avoid getting everyone blocked 👍

dcaro moved this task from Workspace for triaging whenever needed to Ready to be worked on on the Toolforge board.Feb 21 2024, 4:02 PM

Still getting these from time to time. Last time: 11 Apr 6:15:48 AM (UTC)

aborrero mentioned this in T363296: toolforge: explore options to introduce egress network quotas.Wed, Apr 24, 8:09 AM

ChieBot: Intermittent connection reset by peer errorsOpen, HighPublicBUG REPORTActions

Description

Related ObjectsSearch...

Event Timeline

ChieBot: Intermittent connection reset by peer errors
Open, HighPublicBUG REPORT
Actions

Related Objects
Search...