Page MenuHomePhabricator

The EventLoggingKafkaLag alert indicates that the kafka consumer lag for event logging is over its threshold
Closed, ResolvedPublic

Description

We have an alert firing at the moment:

The number of events shows as lagging for each partition of the eventlogging_processor_client_side_00 topic has just jumped from around 36K to around 93K.

(EventLoggingKafkaLag) firing: Kafka consumer lag for event logging over threshold for past 15 min.
Runbook: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration
Graphs: https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&prometheus=ops&var-cluster=jumbo-eqiad&var-topic=All&var-consumer_group=eventlogging_processor_client_side_00
Alert: https://alerts.wikimedia.org/?q=alertname%3DEventLoggingKafkaLag

image.png (988×1 px, 151 KB)

Event Timeline

I'm going to restart this burrow-jumbo-eqiad service on kafkamon1003.

btullis@kafkamon1003:~$ systemctl status burrow-jumbo-eqiad.service 
● burrow-jumbo-eqiad.service - Burrow - Kafka jumbo-eqiad Consumer Offset Monitor
     Loaded: loaded (/lib/systemd/system/burrow-jumbo-eqiad.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2023-07-11 08:22:52 UTC; 24min ago
    Process: 2167715 ExecStart=/usr/bin/Burrow --config-dir /etc/burrow/jumbo-eqiad (code=exited, status=2)
   Main PID: 2167715 (code=exited, status=2)
        CPU: 5.165s

Jul 11 08:22:53 kafkamon1003 Burrow[2167715]:         github.com/Shopify/sarama/consumer.go:445 +0x36e
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]: github.com/Shopify/sarama.withRecover(0xc00092c040)
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]:         github.com/Shopify/sarama/utils.go:45 +0x49
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]: created by github.com/Shopify/sarama.(*consumer).ConsumePartition
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]:         github.com/Shopify/sarama/consumer.go:157 +0x305
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]: goroutine 374 [select]:
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]: github.com/linkedin/Burrow/core/internal/consumer.(*KafkaClient).partitionConsumer(0xc000248120, 0xb96d80, 0xc002c4c140)
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]:         github.com/linkedin/Burrow/core/internal/consumer/kafka_client.go:169 +0x1be
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]: created by github.com/linkedin/Burrow/core/internal/consumer.(*KafkaClient).startKafkaConsumer
Jul 11 08:22:53 kafkamon1003 Burrow[2167715]:         github.com/linkedin/Burrow/core/internal/consumer/kafka_client.go:226 +0x927

Mentioned in SAL (#wikimedia-analytics) [2023-07-11T08:54:09Z] <btullis> systemctl start burrow-jumbo-eqiad.service on kafkamon1003 for T341551

OK, the restart of that services failed:
Lot sof errors, but this one stood out.

Jul 11 08:53:47 kafkamon1003 Burrow[2170959]: fatal error: runtime: out of memory

Host rebooted by btullis@cumin1001 with reason: Rebooting to troubleshoot errors restarting burrow-kafka-jumbo

BTullis triaged this task as High priority.Jul 11 2023, 9:18 AM
BTullis moved this task from Incoming to In Progress on the Data-Platform-SRE board.

A fresh boot didn't fix the problem with the service starting. When burrow starts to monitor kafka-jumbo it runs out of memory after 7 seconds and quits.

Jul 11 08:59:50 kafkamon1003 systemd[1]: Started Burrow - Kafka jumbo-eqiad Consumer Offset Monitor.
Jul 11 08:59:51 kafkamon1003 Burrow[353]: Reading configuration from /etc/burrow/jumbo-eqiad
Jul 11 08:59:51 kafkamon1003 Burrow[353]: Cannot write PID file: open /var/run/burrow/burrow-jumbo-eqiad.pid: no such file or directoryBurrow failed at July 11, 2023 at 8:59am (UTC)
Jul 11 08:59:51 kafkamon1003 systemd[1]: burrow-jumbo-eqiad.service: Main process exited, code=exited, status=1/FAILURE
Jul 11 08:59:51 kafkamon1003 systemd[1]: burrow-jumbo-eqiad.service: Failed with result 'exit-code'.
Jul 11 09:01:40 kafkamon1003 systemd[1]: Started Burrow - Kafka jumbo-eqiad Consumer Offset Monitor.
Jul 11 09:01:40 kafkamon1003 Burrow[2621]: Reading configuration from /etc/burrow/jumbo-eqiad
Jul 11 09:01:47 kafkamon1003 Burrow[2621]: fatal error: runtime: out of memory

Change 937091 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] burrow: add LimitNOFILE=8192 to systemd's units

https://gerrit.wikimedia.org/r/937091

Change 937091 merged by Elukey:

[operations/puppet@production] burrow: add LimitNOFILE=8192 to systemd's units

https://gerrit.wikimedia.org/r/937091

Change 937144 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] burrow: use start-latest=true for the kafka consumer

https://gerrit.wikimedia.org/r/937144

Change 937144 merged by Elukey:

[operations/puppet@production] burrow: use start-latest=true for the kafka consumer

https://gerrit.wikimedia.org/r/937144

To keep archives happy - I think that burrow tried to pull a huge commit log for the eventlogging consumer group, that ended up saturating the memory. Since burrow was configured to start from earliest, then it was stumbling over the issue during every startup. For burrow we can definitely start from latest.