- Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - T220389
- Audit existing Kafka main producers/consumers and document their configuration and use cases - T220390
- Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - T220391
- Upgrade and expand Kafka main cluster - T217359
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T220387 Transition Kafka main ownership from Analytics Engineering to SRE - (2018-2019 Q4 SRE Goal Tracking Task) | |||
Duplicate | None | T220391 Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) | |||
Duplicate | None | T220389 Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years | |||
Declined | None | T220390 Audit existing Kafka main producers/consumers and document their configuration and use cases | |||
Resolved | herron | T217359 Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019. | |||
Unknown Object (Task) | |||||
Unknown Object (Task) | |||||
Resolved | herron | T226274 (Need By: June 30) rack/setup/install kafka-main100[1-5] | |||
Resolved | None | T225005 Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] | |||
Resolved | elukey | T288825 Rebalance kafka partitions in main-{eqiad,codfw} clusters |
Event Timeline
One thing that we didn't discuss for this goal is Zookeeper. At the moment multiple things are using conf100[4-6] hosts:
- Hadoop Yarn (leader election + rmstore)
- Hadoop HDFS (leader election)
- Kafka Main eqiad
- Kafka Jumbo
- Kafka logging
- Kafka Burrow (consumer lag metrics)
Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.
For the purposes of this quarter goal it will be out of scope, but still something we should absolutely plan for.
Zookeeper also holds the Kafka ACLs for topic, only used by Kafka Jumbo afaik at the moment. The best thing to do in my opinion is to request new conf10XX-like hosts only for analytics and move away the Hadoop use case. We can make it optional to avoid blocking this task, but if everybody agrees I'll add a subtask and start working on it.
Sounds like a reasonable approach to me, and I think the timing lines up well with planning for FY19/20 needs. A linked but separate tracking task would be preferable IMHO.
To keep archives happy: this happened in T217057, now the Zookepeer cluster conf100[4-6] and conf200[1-3] are only managing Kafka-related configs :)
Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)