Page MenuHomePhabricator

Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task)
Open, MediumPublic0 Estimated Story Points

Description

  • Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - T220389
  • Audit existing Kafka main producers/consumers and document their configuration and use cases - T220390
  • Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - T220391
  • Upgrade and expand Kafka main cluster - T217359

Event Timeline

herron triaged this task as Medium priority.Apr 8 2019, 2:02 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2019, 2:02 PM
elukey added a subscriber: elukey.Apr 8 2019, 2:38 PM

One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:

  • Hadoop Yarn (leader election + rmstore)
  • Hadoop HDFS (leader election)
  • Kafka Main eqiad
  • Kafka Jumbo
  • Kafka logging
  • Kafka Burrow (consumer lag metrics)

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

One thing that we didn't discuss for this goal is Zookeeper.

For the purposes of this quarter goal it will be out of scope, but still something we should absolutely plan for.

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

Sounds like a reasonable approach to me, and I think the timing lines up well with planning for FY19/20 needs. A linked but separate tracking task would be preferable IMHO.

herron moved this task from Backlog to Working on on the User-herron board.May 9 2019, 8:05 PM
herron updated the task description. (Show Details)Nov 1 2019, 1:48 PM
herron updated the task description. (Show Details)Nov 1 2019, 2:03 PM
elukey added a comment.Nov 5 2019, 4:12 PM

One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:

  • Hadoop Yarn (leader election + rmstore)
  • Hadoop HDFS (leader election)
  • Kafka Main eqiad
  • Kafka Jumbo
  • Kafka logging
  • Kafka Burrow (consumer lag metrics)

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

To keep archives happy: this happened in T217057, now the Zookepeer cluster conf100[4-6] and conf200[1-3] are only managing Kafka-related configs :)