Page MenuHomePhabricator

purged is not resilient to kafka main nodes going down
Open, MediumPublic

Description

Hello Traffic team,

today rack C7 in codfw went down, and with it kafka-main2003. Some purged instances didn't like this and alarmed for kafka consumer lag. In the logs, the error was:

Nov 15 09:42:45 cp2028 purged[30904]: %4|1605433365.426|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2003.codfw.wmnet:9093/2003: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests

purged seemed stuck in this state, and I had to manually restart it on the affected instances to make it recover.

I have seen this issue with varnishkafka some time ago, namely kafka-jumbo100x hosts going hard down and leaving librdkafka's TCP connections stuck. I haven't seen it in a while though, and I can't recall if we added a specific setting to bypass it.

Event Timeline

elukey created this task.Sun, Nov 15, 10:32 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSun, Nov 15, 10:32 AM
elukey updated the task description. (Show Details)Sun, Nov 15, 10:33 AM

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:03:38Z] <cdanis> T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:10:34Z] <cdanis> restart some purgeds in ulsfo as well T267865 T267867

jijiki triaged this task as Medium priority.Mon, Nov 16, 9:00 AM
Vgutierrez moved this task from Triage to Caching on the Traffic board.Mon, Nov 16, 9:36 AM
Vgutierrez added subscribers: ema, Vgutierrez.