We are embarking on a project to create more event streams (e.g. MW content). While designing these streams, we are also revisiting our [[ https://docs.google.com/presentation/d/1xxnFcxFJQGfbxnlmgwzCGyS8ULG0clP2QHcZ2OeeL1c/edit#slide=id.p | Multi-DC Kafka (main) and stream processing patterns ]].
I've recently learned that Kafka 'stretch' clusters are feasible in situations where cross DC latency is relatively low (< 100ms). From a brief conversation with @BBlack and @ayounsi, it sounds like our eqiad <-> codfw latency is on average around ~30ms, but without guarantees on this.
A Kafka Stretch cluster is a single Kafka Cluster spanning multiple DCs. Using a single Kafka cluster instead of 2 clusters with MirrorMaker in between greatly simplifies the architecture of Multi DC streaming and event driven applications. However, it may result in increased latency for Kafka clients, as well as some tricky broker failover scenarios if we are not careful.
This task is about evaluating creating a new Kafka Stretch cluster in eqiad and codfw, likely with the intention of it eventually replacing the Kafka main clusters. Since the current DC-ops hardware request deadline is **this Friday May 13**, if we do want to do this, we should decide before then so we can request hardware for it.
As I do research, I will update this task description with documentation about a Kafka Stretch cluster will work.
---
Best docs I've found so far:
- https://www.confluent.io/blog/multi-region-data-replication/
(We won't be able to use the 'Observer' feature, as that in Confluent, not Apache Kafka. )Follower fetching is in Apache Kafka though.)
- http://mbukowicz.github.io/kafka/2020/08/31/kafka-in-multiple-datacenters.html
- https://downloads.ctfassets.net/oxjq45e8ilak/6icgmmPFCHtsw7KzUy9cKj/309ef5db510976b9fe018f3b5b5380a7/Two_and_The_Half_Datacenters_Foundations_of_Multi_DC_Kafka_Devoops.pdf
Notes:
- Zookeeper should be run in more than 3 DCs, e.g. '2.5' DCs. Kafka brokers are stretched between 2, but Zookeeper is 'stretched' between 3 so there is a tiebreaker ZK instance. Perhaps we'd run at least on ZK node in ulsfo?
- With follower fetching, consumers can read from partition replicas in their own DC.
- IIUC, producers will always have to produce to the leader of the partition, which will only be on one broker at a time, which means only in one DC at a time. This will introduce latency to some produce requests, especially if `acks=all`. If `acks=all`, then all in-sync replicas must ACK the message before the Kafka producer client will receive its ACK. If I partition X has replicas in B1(eqiad), B2(codfw), B3(codfw), and I produce to partition X from codfw, my message will go: Producer(codfw) -> B1(eqiad) -> (B2(codfw) and B3(codfw)) and then ACK all the way back again, crossing the DC twice before the producer receives ACK.
- The leadership of a partition can change at any time, but will change most often (for us) during regular broker restarts. This means that regular rolling broker restarts will shuffle the cross DC network connections. As each broker is restarted, producers to partition leaders on that broker will re-connect to whatever brokers arer promoted to leaders for those partitions.
- Replica placement matters. The default 'rack' aware placement algorithm can be found here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment#KIP36Rackawarereplicaassignment-ProposedChanges