Page MenuHomePhabricator

ChieBot: Intermittent connection reset by peer errors
Open, HighPublicBUG REPORT

Description

My tool started getting intermittent connection reset by peer errors in the past few days. The tool automatically retries the connection after a 1 minute timeout up to 5 times and most of the time it is not enough:

11:50:04 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:51:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:52:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:53:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
11:54:05 PM Got 'Unable to read data from the transport connection: Connection reset by peer.', waiting for 00:01:00
After 5 retries: System.Net.WebException: Unable to read data from the transport connection: Connection reset by peer. ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer. ---> System.Net.Sockets.SocketException: Connection reset by peer

This coincided with the recent migration to k8s, but I'm not sure if it is actually related. My tool has been running successfully for over a decade without encountering such problems.

Event Timeline

dcaro renamed this task from Intermittent connection reset by peer errors to ChieBot: Intermittent connection reset by peer errors.Jan 30 2024, 9:34 AM

Just stating for the record that connection refused/reset messages will come from our edge caching layer, specifically from the tcp stack of our servers there, so it wouldn't be related to a migration to kubernetes (which is still only partial, btw).

dcaro triaged this task as High priority.Jan 30 2024, 4:13 PM
dcaro moved this task from Backlog to Workspace for triaging whenever needed on the Toolforge board.

migration to kubernetes (which is still only partial, btw).

I was talking about the migration of my tool. It's now running on k8s 100%.

@Joe thanks! Yes, the issue is unrelated to the k8s workers, we were just hitting the limit of concurrent connections to the CDN per-ip.

In that sense, @Leloiandudu have you had any issues lately? We found that there was another tool doing many requests in parallel, that might have affected all the tools running on the same worker node (including yours), but the maintainer is working on fixing that, so should be solved soon.

If your tool does requests in parallel, just make sure to do only a few, as we can only have 500 per node, that is shared with many other tools, after that is reached, the CDN will drop all requests for 300s.

Also, reminder in case you were not doing it, to use a proper user-agent (https://meta.wikimedia.org/wiki/User-Agent_policy) when doing requests to the wikis, that will help you avoid getting rate-limited and help the wikis know the requests come from a tool (and which one).

If you are still finding issues, please let me know so I can debug and try to figure out which other tools might be doing many requests in parallel to help them avoid getting everyone blocked 👍

Still getting these from time to time. Last time: 11 Apr 6:15:48 AM (UTC)