Page MenuHomePhabricator

Ensure no dropped messages in eventlogging producers when stopping broker
Closed, ResolvedPublic13 Estimated Story Points

Description

Now that we've switched to kafka-python, we no longer see eventlogging processes die during broker restarts. That's good! But it seems that kafka-python does drop a few number of messages when a broker restarts.

I tested increasing both retries and retry_backoff_ms for the Kafka producer, but I still saw a few retrying (0 attempts left). Error: <class 'kafka.errors.NotLeaderForPartitionError'> during a broker restart. kafka-python doesn't seem to log the final failed produce itself, but it does increase some internal metrics.

  1. kafka-python should log something here! Submit a fix upstream to do so.
  2. See if we can reproduce this outside of production. We might have to pipe quite a few messages through a processor to see this.
  3. If we can reproduce, find a solution. The fix could just be more producer tuning, but this also could be a bug in kafka-python.

Event Timeline

Ottomata triaged this task as Medium priority.Aug 9 2016, 3:11 PM
Ottomata added a subscriber: elukey.

Am testing high volume production for eventbus in labs. Created this upstream issue: https://github.com/dpkp/kafka-python/issues/801

kafka-python 1.3.1 was released yesterday. This fixed a bug I found when running sync producer in labs and stopping a broker. I don't lose messages in sync mode during broker stops. However, the situation described above still happens in production analytics eventlogging async producer with retries=6 and retry_backoff_ms=200.

I can't reproduce this in labs. A quick look at logs on eventlog1001 make it clear that this is still happening. It seems that produce requests in flight exhaust their configured retries sending to the disconnected (and no longer leader) broker. Subsequent produce requests succeed find because the metadata has been updated for those.

Not totally sure how to fix since I can't reproduce. I'd like to suggest trying out a newer version of confluent-kafka-python for analytics eventlogging async kafka producer. Been trying this in labs, and I see a small performance boost out of using confluent. I generally trust librdkafka, so hopefully using this for async produce won't fail us during broker restarts.

Thoughts?

Change 324745 had a related patch set uploaded (by Ottomata):
Manually specify Kafka api_version in kafka_clusters config

https://gerrit.wikimedia.org/r/324745

Change 324745 merged by Ottomata:
Manually specify Kafka api_version in kafka_clusters config

https://gerrit.wikimedia.org/r/324745

Change 324790 had a related patch set uploaded (by Ottomata):
Set api_version for eventlogging kafka consumers as well as producers

https://gerrit.wikimedia.org/r/324790

Change 324790 merged by Ottomata:
Set api_version for eventlogging kafka consumers as well as producers

https://gerrit.wikimedia.org/r/324790

Ottomata renamed this task from Ensure no dropped messages in analytics eventlogging processors when stopping broker to Ensure no dropped messages in eventlogging producers when stopping broker.Dec 1 2016, 8:30 PM

I added api_version as a param to avoid possible dropped messages due to https://issues.apache.org/jira/browse/KAFKA-3547 for analytics async kafka-python producers. I didn't expect this to actually help the analytics async use, and it didn't. Will try it for the eventbus 'sync' use case and see how it goes. The problems of dropped messages are different in both of these cases.

Change 324801 had a related patch set uploaded (by Ottomata):
Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer

https://gerrit.wikimedia.org/r/324801

Change 324801 merged by Ottomata:
Revert min.insync.replicas to 1, set api_version for eventbus Kafka producer

https://gerrit.wikimedia.org/r/324801

I just applied the api_version change for the eventbus kafka-python producer. Good news: this eliminated the ArrayIndexOutOfBoundsException caused by https://issues.apache.org/jira/browse/KAFKA-3547. Bad news: it did nothing to help our dropped produce requests problem.

I'm currently seeing KafkaTimeoutError: Timeout after waiting for 2.0 secs. for several eventbus produce requests when I restart main Kafka brokers.

We could try increasing sync_timeout, maybe 2 is just too short.

Or we could try https://gerrit.wikimedia.org/r/#/c/302372/ and async. Grasping at straws here :/

Mentioned in SAL (#wikimedia-operations) [2016-12-07T18:10:21Z] <ottomata> will be bouncing some main-eqiad kafka brokers to try to troubleshoot T142430

Change 325813 had a related patch set uploaded (by Ottomata):
Increase sync_timeout=10.0 for eventbus service kafka producer

https://gerrit.wikimedia.org/r/325813

Change 325813 merged by Ottomata:
Increase sync_timeout=10.0 for eventbus service kafka producer

https://gerrit.wikimedia.org/r/325813

! Ah! Increasing sync_timeout to 10.0 worked! The default of 2.0 was too little. I also tried 5.0 and saw the KafkaTimeoutErrors. When all services at 10.0, I saw 0 dropped messages during a broker restart. I believe this gives the kafka-python client enough time to refresh leadership metadata and actually submit a retry to the new leader.

Phew, this is looking WAY better.

I've just been trying out the confluent kafka producer for async analytics purposes. Looking pretty good too! I'll merge a few puppet changes (once gerrit comes back up), and if all goes well we may be able to close this ticket...

Change 325837 had a related patch set uploaded (by Ottomata):
Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor

https://gerrit.wikimedia.org/r/325837

Change 325837 merged by Ottomata:
Add hiera lookup for kafka_producer_scheme so we can try out kafka-confluent producer for analytics eventlogging processor

https://gerrit.wikimedia.org/r/325837

Change 325845 had a related patch set uploaded (by Ottomata):
Set kafka_producer_scheme to kafka-confluent for analytics eventlogging processor

https://gerrit.wikimedia.org/r/325845

Change 325845 merged by Ottomata:
Set kafka_producer_scheme to kafka-confluent for analytics eventlogging processor

https://gerrit.wikimedia.org/r/325845

Great. I feel good about this! I think the async kafka-confluent is gonna do us better for analytics, and the sync kafka-python for eventbus with a bigger sync timeout is good.

Ottomata set the point value for this task to 13.Dec 7 2016, 10:04 PM
Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.

Change 325846 had a related patch set uploaded (by Ottomata):
Add python-confluent-kafka to eventlogging::dependencies

https://gerrit.wikimedia.org/r/325846

Change 325846 merged by Ottomata:
Add python-confluent-kafka to eventlogging::dependencies

https://gerrit.wikimedia.org/r/325846

Change 326148 had a related patch set uploaded (by Volans):
Revert "Add python-confluent-kafka to eventlogging::dependencies"

https://gerrit.wikimedia.org/r/326148

Change 326148 merged by Volans:
Revert "Add python-confluent-kafka to eventlogging::dependencies"

https://gerrit.wikimedia.org/r/326148

Please see T152771, we had to revert this last change to unbreak Puppet on stat1002.