Page MenuHomePhabricator

Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task)
Closed, ResolvedPublic0 Estimated Story Points

Description

  • Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - T220389
  • Audit existing Kafka main producers/consumers and document their configuration and use cases - T220390
  • Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - T220391
  • Upgrade and expand Kafka main cluster - T217359

Event Timeline

herron triaged this task as Medium priority.Apr 8 2019, 2:02 PM
herron created this task.

One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:

  • Hadoop Yarn (leader election + rmstore)
  • Hadoop HDFS (leader election)
  • Kafka Main eqiad
  • Kafka Jumbo
  • Kafka logging
  • Kafka Burrow (consumer lag metrics)

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

One thing that we didn't discuss for this goal is Zookeeper.

For the purposes of this quarter goal it will be out of scope, but still something we should absolutely plan for.

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

Sounds like a reasonable approach to me, and I think the timing lines up well with planning for FY19/20 needs. A linked but separate tracking task would be preferable IMHO.

One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:

  • Hadoop Yarn (leader election + rmstore)
  • Hadoop HDFS (leader election)
  • Kafka Main eqiad
  • Kafka Jumbo
  • Kafka logging
  • Kafka Burrow (consumer lag metrics)

Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.

To keep archives happy: this happened in T217057, now the Zookepeer cluster conf100[4-6] and conf200[1-3] are only managing Kafka-related configs :)

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

elukey claimed this task.