Now that we've switched to kafka-python, we no longer see eventlogging processes die during broker restarts. That's good! But it seems that kafka-python does drop a few number of messages when a broker restarts.
I tested increasing both retries and retry_backoff_ms for the Kafka producer, but I still saw a few retrying (0 attempts left). Error: <class 'kafka.errors.NotLeaderForPartitionError'> during a broker restart. kafka-python doesn't seem to log the final failed produce itself, but it does increase some internal metrics.
- kafka-python should log something here! Submit a fix upstream to do so.
- See if we can reproduce this outside of production. We might have to pipe quite a few messages through a processor to see this.
- If we can reproduce, find a solution. The fix could just be more producer tuning, but this also could be a bug in kafka-python.