Hello Traffic team,
today rack C7 in codfw went down, and with it kafka-main2003. Some purged instances didn't like this and alarmed for kafka consumer lag. In the logs, the error was:
Nov 15 09:42:45 cp2028 purged[30904]: %4|1605433365.426|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2003.codfw.wmnet:9093/2003: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests
purged seemed stuck in this state, and I had to manually restart it on the affected instances to make it recover.
I have seen this issue with varnishkafka some time ago, namely kafka-jumbo100x hosts going hard down and leaving librdkafka's TCP connections stuck. I haven't seen it in a while though, and I can't recall if we added a specific setting to bypass it.