We currently (or will soon) have 3 distinct Kafka clusters:
- analytics-eqiad - The is the large Kafka cluster in the Analytics VLAN.
- main-eqiad - kafka100, used for 'main' (AKA production) Kafka messages in eqiad
- main-codfw- kafka200 used for 'main' (AKA production) Kafka messages in codfw. (As of 2015-01 this is blocked.)
Messages in the 2 main Kafka clusters will be mirrored to the analytics-eqiad Kafka cluster for ingestion into Hadoop and whatever other analytics things we got going. This will be done using Kafka MirrorMaker.
Over in T114191, @aaron and @ori describe wanting to consume in eqiad from main-codfw, and vice versa, for cache invalidation. In T117933, the Services team is working on change propagation for RESTBase (which I think is essentially cache invalidation). I believe they also want RESTBase clusters in both eqiad and codfw to consume the same messages. I.e. codfw RESTbase will need change propagation events from both eqiad and codfw main Kafka clusters (Services folks, correct me if I am wrong).
There are two ways to accomplish this:
- Cross Datacenter Consumption
- Kafka MirrorMaker
As suggested in T114191, consumers can consume messages from Kafka cross datacenters. This may work well enough for our use cases, but it is not the recommended way to do things.
For analytics-eqiad, using MirrorMaker to consume from both of the main Kafka clusters is easy. Producers will either produce main messages to a local main cluster, OR they will be analytics producers and produce directly to the analytics-eqiad cluster (e.g. varnishkafka). MirrorMaker will not consume from analytics-eqiad.
But since the 2 main Kafka clusters are functionally the same, in order to use MirrorMaker between them we have to ensure that they don't mirror topics in a feedback loop. If we want to mirror messages between main-eqiad and main-codfw so that consumers local to each DC only have to consume from their local main cluster, we have to make sure that producers in each DC produce to a topic that only receives messages for that DC. E.g. topic main-eqiad.mediawiki.page_edit would be produced to by producers in eqiad, and topic main-codfw.mediawiki.page_edit would be produced to by producers in codfw. MirrorMaker could then mirror main-eqiad.mediawiki.page_edit to main-codfw cluster, and main-codfw.mediawiki.page_edit to main-eqiad cluster. Using the same topic name in each cluster would lead to a feedback loop, where the MirrorMaker in each cluster would re-produce the mirrored message back to the original cluster.
https://engineering.linkedin.com/kafka/running-kafka-scale is a good reference for how LinkedIn does mirroring. In short, they have local and aggregate clusters in each DC. Producers in each DC produce to their local cluster. MirrorMaker in each DC consumes from all local clusters in each DC and produces to the DCs aggregate cluster. In this setup, the main local clusters don't have all messages from all DCs, but each DC's aggregate cluster does, which allows all Kafka consumption in each DC to be done locally from the DC's aggregate cluster, instead of reaching out across DCs.
We don't have plans to set up more Kafka clusters, so we need to figure out if we are ok with having consuming cross-DC, or if we should go the route of naming topics after DCs so we can mirror messages between main Kafka clusters.