Page MenuHomePhabricator

purged issues while kafka brokers are restarted
Closed, ResolvedPublic

Description

Hi folks!

I am moving kafka-main eqiad brokers to PKI, and today I've restarted kafka-main1002 and kafka-main1003's brokers. I noticed some purged alerts for some cp nodes, all of them with a similar output (this is from cp5032):

Apr 05 08:06:16 cp5032 purged[2085976]: %6|1680681976.892|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Disconnected (after 422825216ms in state UP)
Apr 05 08:06:17 cp5032 purged[2085976]: %3|1680681977.129|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Connect to ipv6#[2620:0:861:102:10:64:16:37]:9093 failed: Connection refused (after 236ms in state CONNECT)
Apr 05 08:06:17 cp5032 purged[2085976]: %3|1680681977.372|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Connect to ipv4#10.64.16.37:9093 failed: Connection refused (after 242ms in state CONNECT)
Apr 05 08:06:17 cp5032 purged[2085976]: %3|1680681977.618|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Connect to ipv6#[2620:0:861:102:10:64:16:37]:9093 failed: Connection refused (after 234ms in state CONNECT)
Apr 05 08:06:17 cp5032 purged[2085976]: %3|1680681977.995|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Connect to ipv4#10.64.16.37:9093 failed: Connection refused (after 242ms in state CONNECT)
Apr 05 08:06:19 cp5032 purged[2085976]: %3|1680681979.048|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1002.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1002.eqiad.wmnet:9093/1002: Connect to ipv6#[2620:0:861:102:10:64:16:37]:9093 failed: Connection refused (after 229ms in state CONNECT)
Apr 05 09:35:11 cp5032 purged[2085976]: %6|1680687311.705|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1003.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1003.eqiad.wmnet:9093/1003: Disconnected (after 72679977ms in state UP)
Apr 05 09:35:11 cp5032 purged[2085976]: %6|1680687311.768|FAIL|purged#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: kafka-main1003.eqiad.wmnet:9093: Disconnected (after 5327389ms in state UP)
Apr 05 09:35:11 cp5032 purged[2085976]: %3|1680687311.934|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1003.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1003.eqiad.wmnet:9093/1003: Connect to ipv4#10.64.32.90:9093 failed: Connection refused (after 229ms in state CONNECT)
Apr 05 09:35:11 cp5032 purged[2085976]: %3|1680687311.997|FAIL|purged#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: kafka-main1003.eqiad.wmnet:9093: Connect to ipv4#10.64.32.90:9093 failed: Connection refused (after 229ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.193|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1003.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1003.eqiad.wmnet:9093/1003: Connect to ipv6#[2620:0:861:103:10:64:32:90]:9093 failed: Connection refused (after 258ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.225|FAIL|purged#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: kafka-main1003.eqiad.wmnet:9093: Connect to ipv6#[2620:0:861:103:10:64:32:90]:9093 failed: Connection refused (after 228ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.471|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1003.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1003.eqiad.wmnet:9093/1003: Connect to ipv4#10.64.32.90:9093 failed: Connection refused (after 242ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.475|FAIL|purged#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: kafka-main1003.eqiad.wmnet:9093: Connect to ipv4#10.64.32.90:9093 failed: Connection refused (after 250ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.913|FAIL|purged#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: kafka-main1001.eqiad.wmnet:9093: Connect to ipv6#[2620:0:861:103:10:64:32:90]:9093 failed: Connection refused (after 250ms in state CONNECT)
Apr 05 09:35:12 cp5032 purged[2085976]: %3|1680687312.939|FAIL|purged#consumer-1| [thrd:ssl://kafka-main1003.eqiad.wmnet:9093/bootstrap]: ssl://kafka-main1003.eqiad.wmnet:9093/1003: Connect to ipv6#[2620:0:861:103:10:64:32:90]:9093 failed: Connection refused (after 237ms in state CONNECT)
Apr 05 09:35:19 cp5032 purged[2085976]: %4|1680687319.299|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10070 ms without a successful response from the group coordinator (broker 1001, last error was Broker: Not coordinator): revoking assignment and rejoining group
Apr 05 09:40:31 cp5032 purged[2085976]: %4|1680687631.774|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10205 ms without a successful response from the group coordinator (broker 1003, last error was Local: Broker transport failure): revoking assignment and rejoining group

After a restart it all worked fine, but I am wondering if purged code should be more resilient to consumer group changes (or maybe if there is another problem).

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Handle kafka errors following kafka best practicesrepos/sre/purged!7fabfurretry-kafka-errorsmain
Basic retry mechanism for specific kafka errorsrepos/sre/purged!1brettchange-970332-retry-for-kafka-errorsmain
Customize query in GitLab

Event Timeline

I see stable tcp connections to kafka-main1002 and 1003 after the restart on cp5032, so I am inclined to say that it is not related to the PKI cert rollout:

elukey@cp5032:~$ sudo netstat -tunap | egrep '10.64.32.90|10.64.16.37'
tcp        0    182 10.132.0.16:21201       10.64.32.90:9093        ESTABLISHED 1716611/purged      
tcp        0      0 10.132.0.16:51181       10.64.32.90:9093        ESTABLISHED 1716611/purged      
tcp        0    182 10.132.0.16:59087       10.64.16.37:9093        ESTABLISHED 1716611/purged

Also I don't see any more error in the purged logs..

Aklapper renamed this task from purged issues whule kafka brokers are restarted to purged issues while kafka brokers are restarted.Apr 5 2023, 1:04 PM

From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts.

From a brief glance, those look like normal consumer reassignment messages. Probably shouldn't be alerts.

@Ottomata I thought so yes, but I got alerted because of lag in consuming building up from some nodes, and until I restarted purged the lag kept increasing. It seemed as if purged got stuck in some weird limbo and it wasn't consuming anymore (but only on some nodes, like 5, the majority kept working fine). I added all the relevant logs that I found, maybe there is a corner case in our code or something similar that we are not taking into account?

Interesting. Each purged instance is in a distinct consumer group, yes?

What kafka client is it using?

(Just curious, neither answers will clue me in as to why they stopped working :) )

purged uses github.com/confluentinc/confluent-kafka-go/kafka

We had two servers (cp1089 and cp3069) having purged issues over the weekend, after losing connection to the kafka cluster and logging:

Oct 28 05:19:11 cp1089 purged[609384]: %4|1698470351.007|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10000 ms without 
a successful response from the group coordinator (broker 1003, last error was Broker: Not coordinator): revoking assignment and rejoining group

cp1089 stopped processing messages till purged was restarted. Purged@cp3069 didn't produce any kind of error but stopped processing messages roughly at the same time according to grafana: https://grafana.wikimedia.org/goto/uaXREi4Iz?orgId=1 and it didn't recover till purged was restarted

Did something happen on your side during the weekend?

The purged kafka client could benefit from implementing some kind of watchdog like attempting to reconnect to the kafka cluster if event lag reaches 3600s (or any other value for that threshold) or simply exit and let systemd restart it

Adding, for complete information, that the list of hosts impacted with the same purged error this weekend were:

  • cp1078
  • cp1089
  • cp6005
  • cp3069

We need to work on purged Kafka consumer. I've already spotted the issue on our codebase

Change 970332 had a related patch set uploaded (by Fabfur; author: Fabfur):

[operations/software/purged@master] Basic retry mechanism for specific kafka errors

https://gerrit.wikimedia.org/r/970332

Change 970332 abandoned by Hashar:

[operations/software/purged@master] Basic retry mechanism for specific kafka errors

Reason:

Project has been migrated to Gitlab (T347623): https://gitlab.wikimedia.org/repos/sre/purged

https://gerrit.wikimedia.org/r/970332

BCornwall changed the task status from Open to Stalled.Jul 23 2024, 9:39 PM
BCornwall moved this task from Actively Servicing to Backlog on the Traffic board.

this has been triggered again in cp2038 and cp2041:

vgutierrez@cumin1002:~$ sudo -i cumin 'cp[2038,2041].codfw.wmnet' 'journalctl -u purged.service --since=-18h |grep "timed out"' 
2 hosts will be targeted:
cp[2038,2041].codfw.wmnet
OK to proceed on 2 hosts? Enter the number of affected hosts to confirm or "q" to quit: 2
===== NODE GROUP =====                                                                                                                                                           
(1) cp2041.codfw.wmnet                                                                                                                                                           
----- OUTPUT of 'journalctl -u pu...grep "timed out"' -----                                                                                                                      
Sep 05 17:01:19 cp2041 purged[3565730]: %4|1725555679.879|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10216 ms without a successful response from the group coordinator (broker 2003, last error was Broker: Not coordinator): revoking assignment and rejoining group                                
===== NODE GROUP =====                                                                                                                                                           
(1) cp2038.codfw.wmnet                                                                                                                                                           
----- OUTPUT of 'journalctl -u pu...grep "timed out"' -----                                                                                                                      
Sep 05 17:01:21 cp2038 purged[4019621]: %4|1725555681.014|SESSTMOUT|purged#consumer-1| [thrd:main]: Consumer group session timed out (in join-state steady) after 10498 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group                                
================

Mentioned in SAL (#wikimedia-operations) [2024-09-06T04:24:56Z] <vgutierrez> restarting purged in cp2038 && cp2041 - T334078

Vgutierrez changed the task status from Stalled to In Progress.Sep 6 2024, 10:56 AM
Vgutierrez claimed this task.
Vgutierrez moved this task from Backlog to Actively Servicing on the Traffic board.

Mentioned in SAL (#wikimedia-operations) [2024-09-17T07:26:32Z] <vgutierrez> testing purged 0.23 in cp4038 - T334078

Mentioned in SAL (#wikimedia-operations) [2024-09-18T12:44:00Z] <vgutierrez> uploaded purged 0.23 to bullseye-wikimedia (apt.wm.o) - T334078

Mentioned in SAL (#wikimedia-operations) [2024-09-18T12:46:11Z] <vgutierrez> rolling upgrade to purged 0.23 in A:cp-ulsfo - T334078

@elukey I think I've found the root cause of this mess, this is the patch:

diff --git a/kafka.go b/kafka.go
index 4e495ce..b878db8 100644
--- a/kafka.go
+++ b/kafka.go
@@ -187,7 +187,7 @@ func (k *KafkaReader) manageEvent(event kafka.Event, c chan string) bool {
                if err != nil {
                        log.Printf("Unable to update promrdkafka metrics: %v\n", err)
                }
-       case *kafka.Error:
+       case kafka.Error:
                kafkaErrors.With(prometheus.Labels{typeLabel: e.Code().String()}).Inc()

Mentioned in SAL (#wikimedia-operations) [2024-09-19T09:44:51Z] <vgutierrez> testing purged 0.24 in cp4038 - T334078

Mentioned in SAL (#wikimedia-operations) [2024-09-19T16:48:08Z] <vgutierrez> updated to purged 0.24 in codfw, ulsfo and eqsin - T334078

Mentioned in SAL (#wikimedia-operations) [2024-09-20T07:18:05Z] <vgutierrez> rolling upgrade of purged on magru, drmrs, esams and eqiad - T334078

purged 0.24 shipping the patch mentioned above has been deployed everywhere. Now we should expect the following behavior: purged will start logging and gathering metrics of kafka errors. If the received error is considered fatal (only all brokers down error is flagged as fatal at the moment) purged will exit and systemd should restart it.
If a non-fatal error triggers the same behavior as the described on this task we should add that error code to the list of fatal errors.

Next steps:

  • use a recent version of golang-github-confluentinc-confluent-kafka-go-dev that includes IsFatal() instead of tracking ourselves the fatal errors on purged code

purged 0.24 survived to Consumer group session timed out (in join-state steady) after 10458 ms without a successful response from the group coordinator (broker 2001, last error was Broker: Not coordinator): revoking assignment and rejoining group error this morning, I'll close this task as it seems like the bug has been solved.