Page MenuHomePhabricator

purged is not resilient to kafka main nodes going down
Open, MediumPublic

Description

Hello Traffic team,

today rack C7 in codfw went down, and with it kafka-main2003. Some purged instances didn't like this and alarmed for kafka consumer lag. In the logs, the error was:

Nov 15 09:42:45 cp2028 purged[30904]: %4|1605433365.426|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:9093/bootstrap]: ssl://kafka-main2003.codfw.wmnet:9093/2003: Timed out 0 in-flight, 1 retry-queued, 0 out-queue, 0 partially-sent requests

purged seemed stuck in this state, and I had to manually restart it on the affected instances to make it recover.

I have seen this issue with varnishkafka some time ago, namely kafka-jumbo100x hosts going hard down and leaving librdkafka's TCP connections stuck. I haven't seen it in a while though, and I can't recall if we added a specific setting to bypass it.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:03:38Z] <cdanis> T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'

Mentioned in SAL (#wikimedia-operations) [2020-11-15T22:10:34Z] <cdanis> restart some purgeds in ulsfo as well T267865 T267867

jijiki triaged this task as Medium priority.Nov 16 2020, 9:00 AM
BBlack subscribed.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!