purged is not resilient to kafka main nodes going down
Open, MediumPublic
Actions

Assigned To

None

Authored By

	elukey
	Nov 15 2020, 10:32 AM

Description

Hello Traffic team,

today rack C7 in codfw went down, and with it kafka-main2003. Some purged instances didn't like this and alarmed for kafka consumer lag. In the logs, the error was:

Nov 15 09:42:45 cp2028 purged[30904]: %4|1605433365.426|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2003.codfw.wmnet:9093/2003: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests

purged seemed stuck in this state, and I had to manually restart it on the affected instances to make it recover.

I have seen this issue with varnishkafka some time ago, namely kafka-jumbo100x hosts going hard down and leaving librdkafka's TCP connections stuck. I haven't seen it in a while though, and I can't recall if we added a specific setting to bypass it.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		ayounsi	T267865 Switch on rack C7 in codfw is down
		Open		None	T267867 purged is not resilient to kafka main nodes going down

Event Timeline

elukey created this task.Nov 15 2020, 10:32 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2020, 10:32 AM

elukey updated the task description. (Show Details)Nov 15 2020, 10:33 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:03:38Z] <cdanis> T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:10:34Z] <cdanis> restart some purgeds in ulsfo as well T267865 T267867

RhinosF1 subscribed.Nov 15 2020, 10:23 PM

jijiki triaged this task as Medium priority.Nov 16 2020, 9:00 AM

Vgutierrez moved this task from Backlog to Caching on the Traffic board.Nov 16 2020, 9:36 AM

Vgutierrez added subscribers: • ema, Vgutierrez.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BBlack moved this task from Backlog to Minor TODO on the Traffic-Icebox board.Apr 13 2022, 7:54 PM

BBlack moved this task from Minor TODO to Complicated on the Traffic-Icebox board.Dec 7 2022, 6:45 PM

purged is not resilient to kafka main nodes going downOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

purged is not resilient to kafka main nodes going down
Open, MediumPublic
Actions

Related Objects
Search...