Page MenuHomePhabricator

[Epic] Set up multi DC Kafka stretch cluster
Closed, DeclinedPublic


In T314160: Q1:rack/setup/install kafka-stretch200[12], we received and racked hardware for a multi DC Kafka stretch cluster.

T307944: Evaluate Kafka Stretch cluster potential, and if possible, request hardware ASAP capture much of the info needed to set this up. This task will track and serve as a parent task for the remaining work needed.

Discovery-Search may want to use this for T317045: [Epic] Re-architect the Search Update Pipeline, as it will be a way to produce large events in a multi DC Kafka cluster without messing with Kafka main.

There will be decisions to make along the way, e.g.

Event Timeline

A very cool name could be "Octopus" (tentacles spreading in multiple dcs at once).

Gehel renamed this task from Set up multi DC Kafka stretch cluster to [Epic] Set up multi DC Kafka stretch cluster.Oct 18 2023, 8:49 AM
Gehel triaged this task as Medium priority.
Gehel added a project: Epic.
Gehel moved this task from Incoming to Epics on the Data-Platform-SRE board.
Ottomata added subscribers: brouberol, Gehel.

After a discussion with @Gehel and @dcausse, there isn't a lot of interest in using Kafka stretch to enable active/passive double compute streaming. The goal was to have computed output be consistent, but the benefits of this don't outweigh the work required to get this to work (including doing manual failovers of streaming apps), at least for now.

The 4 servers (2 in each DC), can be repurposed for something else. CC @BTullis @dcausse @brouberol @bking

I think that we should:

  • repurpose kafka-stretch100[1-2] to add them to the analytics Hadoop cluster in eqiad (unless anyone has any better ideas for these).
  • repurpose kafka-stretch200[1-2] to create a dse-k8s Kubernetes cluster starting with two worker nodes in codfw.

@Gehel - what do you think?

I think @dcausse was hoping for a new Multi DC Kafka cluster that was not kafka main. One on which he could do fancier things (like topic compaction) without having to risk a MW outage (taking down MW Job queue, for instance).