Page MenuHomePhabricator

Handling of connection reset by peer errors in DLQ
Open, LowPublic8 Estimated Story Points

Description

We need to figure out why and what to do about getting read: connection reset by peer errors in aws.structured-data.article-update-dead-letter.v1 topic (DLQ), so that we don't experience data loss.

Acceptance criteria
Errors with connection reset by peer are properly handled or issue is fixed in code.

Test Strategy
Testing is basically monitoring logs and DLQ topic to make sure that those errors don't appear that anymore and are properly handled.

Notes
Please note that there are much more error messages like this in aws.structured-data.article-update-error.v1 queue.

Examples
Message example from DLQ:

{
	"name": "photocentres",
	"identifier": 2188094,
	"version": {
		"wikimedia_enterprise.general.schema.Version": {
			"identifier": 31806729,
			"comment": "nouveau système de gestion des anagrammes",
			"tags": [],
			"is_minor_edit": false,
			"is_flagged_stable": false,
			"has_tag_needs_citation": false,
			"scores": null,
			"editor": {
				"wikimedia_enterprise.general.schema.Editor": {
					"identifier": -1,
					"name": "....",
					"edit_count": 889101,
					"groups": [
						"bot",
						"*",
						"user",
						"autoconfirmed"
					],
					"is_bot": true,
					"is_anonymous": false,
					"is_admin": false,
					"is_patroller": false,
					"has_advanced_rights": false,
					"date_started": {
						"long": 1323192544000000
					}
				}
			},
			"diff": null,
			"number_of_characters": 0,
			"sizes": null,
			"event": null
		}
	},
	"previous_version": {
		"wikimedia_enterprise.general.schema.PreviousVersion": {
			"identifier": 30209316,
			"number_of_characters": 0
		}
	},
	"version_identifier": "",
	"url": "",
	"watchers_count": 0,
	"namespace": {
		"wikimedia_enterprise.general.schema.Namespace": {
			"name": "",
			"alternate_name": "",
			"identifier": 0,
			"description": "",
			"event": null
		}
	},
	"in_language": {
		"wikimedia_enterprise.general.schema.Language": {
			"identifier": "fr",
			"name": "",
			"alternate_name": "",
			"direction": "",
			"event": null
		}
	},
	"main_entity": null,
	"additional_entities": [],
	"categories": [],
	"templates": [],
	"redirects": [],
	"is_part_of": {
		"wikimedia_enterprise.general.schema.Project": {
			"name": "",
			"identifier": "frwiktionary",
			"url": "https://fr.wiktionary.org",
			"version": "",
			"date_modified": null,
			"in_language": null,
			"namespace": null,
			"sizes": null,
			"additional_type": "",
			"event": null
		}
	},
	"article_body": null,
	"license": [],
	"visibility": null,
	"event": {
		"wikimedia_enterprise.general.schema.Event": {
			"identifier": "37903a00-75d3-4b64-bd30-7a505c7c6e01",
			"type": "update",
			"date_created": {
				"long": 1679017304678565
			},
			"fail_count": 2,
			"fail_reason": "Post \"https://fr.wiktionary.org/w/api.php\": read tcp 30.0.65.86:43818->208.80.154.224:443: read: connection reset by peer"
		}
	}
}

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone

Event Timeline

Felixejofre set the point value for this task to 8.Mar 24 2023, 3:56 PM
Protsack.stephan lowered the priority of this task from High to Medium.Apr 4 2023, 12:23 PM
Tim.abdullin changed the task status from Open to In Progress.Apr 5 2023, 11:26 AM
Tim.abdullin claimed this task.

I wonder if there's something serviceops can help us with, Post \"https://fr.wiktionary.org/w/api.php\": read tcp 30.0.65.86:43818->208.80.154.224:443: read: connection reset by peer looks like the connection was interrupted on server side. Maybe there's to much load from us, or connection time is too long in our HTTP Client?

Hi, I have a couple of unknown terms here, like what DLQ is, but couple of notes:

  • If you ended up creating to much load, you 'd be rate limited or even blocked. HTTP status codes would be either 403 or 429.
  • Connection reset by peer just means an RST TCP packet was for some reason sent in the TCP connection. It can be from the upstream server or it can be from something else along the way.
  • Our CDN (to which your code connects) allows very long lived connections (given we support websockets), so, unless you connect for weeks on end, it's probably isn't the reason for these RSTs
  • On our side, I see no incident on the 17th of March 2023.
  • Given that your code apparently connects from the US Department of Defense address space, there's a higher than normal chance that there is some box along the way terminating the connection, for reasons I can only speculate.

If you can provide us with some more information, maybe we can help a bit more figuring this out. e.g. number of these errors per timespan, endpoints seeing this etc

A couple of solutions would be:

  • Catch the error and retry
  • Make sure that keep-alives are sent.

Thanks @akosiaris for the quick reply and all the help.

Thanks @akosiaris for the quick reply and all the help.

So, you are seeing consistently an endpoint like that above one failing with connection reset by peer? I can look into logs a bit more, could you tell me what's your User-Agent and specific timestamps of those errors?

  • We are working on the solution where we are going slowly increase the backoff time on each retry and increase the number of retries, hoping this will solve the problem.

Thanks for doing that. In case you weren't aware already, https://en.wikipedia.org/wiki/Exponential_backoff is the best solution for this.

  • I have one additional question, you said that in case we create too much load status codes will be 403 or 429. As we do gracefully handle 429 errors, we don't do that for 403 in case of 403 we just retry.
    • Our handling of 429 errors relies on Retry-After header being present, and we backoff for that particular time period. Can we do the same thing for 403 errors or there's any other way you would recommend to handle them?

Retry-After for 429s is ok, thanks for honoring that. A 403 will come with a clear message as to why and asking that a human conducts a followup action. Usually it's just shooting us an email and telling us what issue you face. If you see a ton of 403s, it means that you need to stop what you are doing and come talk to us. We probably were forced to block you manually. For a few 403s, it's more probably that you are trying to access some resource (e.g. a revision) that isn't available anymore (for reasons...). In those cases, it won't cause harm to retry a couple of times, but unlike a 429 it is almost surely not a transient error.

  • Our user agent is WME/2.0 (https://enterprise.wikimedia.com/; wme_mgmt@wikimedia.org), unfortunately because of short TTL on our DLQ right now I can't point you to the precise timestamps. Good thing is that for last 4 - 5 days I can't see those messages anymore.
  • Thank you for mentioning Exponential backoff algorithm, it's really helpful.
  • Thanks for the thorough description of 403 error flows. By a ton of 403s, you mean we get this response basically on every call correct?

I'm going to move this ticket into background for a little bit, as I don't see this happening anymore. Will monitor the DLQ and re-visit if any more errors come around will make sure to post actual errors and timestamps here.

Protsack.stephan changed the task status from In Progress to Open.Apr 11 2023, 5:19 PM
JArguello-WMF lowered the priority of this task from Medium to Low.Aug 29 2023, 3:20 PM