Page MenuHomePhabricator

Set up cross DC topic mirroring for Kafka logging clusters
Open, MediumPublic

Description

https://wikitech.wikimedia.org/wiki/Kafka/Administration#MirrorMaker

Our multi DC kafka setup works like this:

  • Producers prefix topics with their datacenter name, e.g. eqiad.mediawiki.client.error
  • Kafka MirrorMaker in local DC consumers from remote DC all topics prefixed with remote DC name.

In this way, eqiad topics -> codfw, and codfw topics -> eqiad, and all datacenter prefixed topics are available in each local Kafka cluster. This allows consumers that need all messages from a stream to consume from a single local Kafka cluster.

Kafka logging clusters were not set up this way, because they were expected to have only a single consumer: logstash. Logstash is configured (correct?) to consume from Kafka logging clusters in each DC and product to the local ELK stack.

We should set up proper cross DC mirroring for Kafka logging so that consumers don't have to consume from multiple remote Kafka clusters.

We can set up MirrorMaker to do this without altering the existent logstash consumer setup. Later (if desired), logstash can be reconfigured to consume only from the local DC.

Event Timeline

Note: Recent versions of Kafka have a new robust version of MirrorMaker, MirrorMaker 2. One day we should upgrade our Kafka clusters and use that, but I don't think we should do that for logging quite yet.

crusnov triaged this task as Medium priority.Mar 10 2021, 12:27 AM

Basically, a non-aggregate Kafka cluster (Kafka jumbo is 'aggregate') is the source of stream data. Here, a 'stream' refers to mulitple topics, in our case, every DC-prefixed topic*. E.g. eqiad.mediawiki.client.error and codfw.mediawiki.client.error are each part of the same stream.

Consumers should not have to be DC aware. If we had discovery DNS for Kafka clusters, consumers would not even have to be broker aware, they'd just choose logical kafka cluster name like kafka-logging.discovery.wmnet.

Ideally, producers wouldn't have to be DC aware either, but we can't support that with our current mirroring setup and MirrorMaker 1. MirrorMaker 2 makes this possible with automatic topic prefixing.

Ultimately, Kafka should be the 'data backbone' of the org. To do that source Kafka clusters should always be multi DC. The Kafka logging setup is not currently multi DC.

*Prefixing by DCs is not actually the correct thing to do. More correctly would be to prefix by the kafka cluster name, e.g. 'logging-eqiad'.

In T276972, @Ottomata wrote:

Our multi DC kafka setup works like this:

  • Producers prefix topics with their datacenter name, e.g. eqiad.mediawiki.client.error
  • Kafka MirrorMaker in local DC consumers from remote DC all topics prefixed with remote DC name.

In this way, eqiad topics -> codfw, and codfw topics -> eqiad, and all datacenter prefixed topics are available in each local Kafka cluster. This allows consumers that need all messages from a stream to consume from a single local Kafka cluster.
Kafka logging clusters were not set up this way, because they were expected to have only a single consumer: logstash. Logstash is configured (correct?) to consume from Kafka logging clusters in each DC and product to the local ELK stack.

Correct, logstash consumes from clusters in both DCs. At the moment however all producers are pointed to eqiad, so consuming only from eqiad will effectively DTRT.

Also a bit of context, kafka logging clusters are setup the way the are not because Logstash was expected to be the only consumer but because they are meant to spool logs, producers (i.e. rsyslog) are routed to the "closest" active DC in normal situations to do active/active. The routing bit isn't implemented as of today (hence why all producers point to eqiad) but that's the plan.

We should set up proper cross DC mirroring for Kafka logging so that consumers don't have to consume from multiple remote Kafka clusters.
We can set up MirrorMaker to do this without altering the existent logstash consumer setup. Later (if desired), logstash can be reconfigured to consume only from the local DC.

To make sure I'm understanding correctly: the topics that would get mirrored eqiad <-> codfw are the only topics already prefixed with DCs and not all topics in kafka-logging ?

Basically, a non-aggregate Kafka cluster (Kafka jumbo is 'aggregate') is the source of stream data. Here, a 'stream' refers to mulitple topics, in our case, every DC-prefixed topic*. E.g. eqiad.mediawiki.client.error and codfw.mediawiki.client.error are each part of the same stream.

Consumers should not have to be DC aware. If we had discovery DNS for Kafka clusters, consumers would not even have to be broker aware, they'd just choose logical kafka cluster name like kafka-logging.discovery.wmnet.

Ideally, producers wouldn't have to be DC aware either, but we can't support that with our current mirroring setup and MirrorMaker 1. MirrorMaker 2 makes this possible with automatic topic prefixing.

Ultimately, Kafka should be the 'data backbone' of the org. To do that source Kafka clusters should always be multi DC. The Kafka logging setup is not currently multi DC.

*Prefixing by DCs is not actually the correct thing to do. More correctly would be to prefix by the kafka cluster name, e.g. 'logging-eqiad'.

I think we'll have to agree to disagree on some of these points (kafka-logging isn't multi dc, consumers/producers shouldn't be dc aware); prefixing topics with dc (or kafka cluster) names seems like an anti-pattern to me. Specifically the awareness of what needs to be consumed/produced lives in the topic prefix, as opposed for the consumer to be able to support consuming from > 1 kafka cluster.

There are of course use cases where producers should consume only from their local DC kafka, e.g. metrics from topics is one such case, where we'd deploy consumers in codfw/eqiad to calculate metrics only from their local kafka and aggregate in dashboards.

lmata moved this task from Inbox to Radar on the observability board.
lmata moved this task from Radar to Backlog on the SRE board.