Maniphest T152015

Provision new Kafka cluster(s) with security features
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

Authored By

	Ottomata
	Nov 30 2016, 4:55 PM

Description

The 'analytics-eqiad' Kafka cluster hardware is due to be refreshed. We also want to enable Kafka security features. We also want to move this large beefy Kafka cluster out of the 'Analytics Cluster' / Analytics VLAN, and make it a fully productionized Kafka cluster, available for use of production services. We may eventually merge the existent smaller main-* Kafka clusters into these new ones, but we don't have plans to do that within the next year.

This is the parent ticket for the provisioning, upgrade and migration plan for these new clusters.

Details

	Subject	Repo	Branch	Lines +/-
	Update kafka-jumbo to kafka 0.11.0.1	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Declined	elukey	T166833 Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field
Resolved	Ottomata	T152015 Provision new Kafka cluster(s) with security features
Resolved	mforns	T121561 Encrypt Kafka traffic, and restrict access via ACLs
Resolved	Ottomata	T121562 Upgrade analytics-eqiad Kafka cluster to Kafka 0.9
Resolved	Ottomata	T132595 Experiment with new Kafka versions and verify that they work with existing clients
Resolved	Ottomata	T132631 Puppetize and make useable confluent kafka packages
Duplicate	None	T166832 Import Kafka messages into HDFS authenticating with TLS/SSL
		Restricted Task
		Unknown Object (Task)
		Unknown Object (Task)
Resolved	Ottomata	T166162 Update puppet for new Kafka cluster and version
Resolved	Ottomata	T166164 Update kafka.sh wrapper script for Kafka 0.10+
Resolved	Ottomata	T166167 Write generic certificate management software for use with Puppet and Self Signing CAs.
Resolved	elukey	T167304 Understand Kafka ACLs and figure out what ACLs we want for production topics
Resolved	elukey	T168538 Perf test RAID vs JBOD with new hardware and kafka versions
Resolved	elukey	T167992 rack/setup/install new kafka nodes kafka-jumbo100[1-6]
Resolved	• Cmjohnson	T173837 kafka-jumbo1004 h/w problem most likely raid card
Resolved	RobH	T174457 kafka-jumbo.cfg partman recipe creation/troubleshooting
Resolved	Ottomata	T175461 Port Kafka clients to new jumbo cluster
Declined	Ottomata	T176352 Port statsv to kafka-jumbo
Resolved	Ottomata	T179093 Support multi DC statsv
Resolved	Ottomata	T183297 Move EventLogging analytics processes to Kafka jumbo-eqiad cluster
Resolved	Ottomata	T185136 Move webrequest varnishkafka and consumers to Kafka jumbo cluster.
		Restricted Task
Resolved	Ottomata	T187890 Refactor kafkatee module to support multi instance
Resolved	Ottomata	T185225 Move EventStreams to main Kafka clusters
Resolved	Ottomata	T196553 Support connection/rate limiting in EventStreams
Resolved	Ottomata	T188136 Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to eventgate-analytics
Declined	None	T126494 Send Mediawiki Kafka logs to Kafka jumbo cluster with TLS encryption
Resolved	dcausse	T188408 Migrate mjolnir Kafka clients to use Kafka jumbo
Resolved	Ottomata	T189464 Fix Mirror Maker erratic behavior when replicating from main-eqiad to jumbo
Resolved	Ottomata	T189611 Alert for Kafka MirrorMaker lag
Resolved	Ottomata	T190049 Spike: Consider alternatives to MirrorMaker: uReplicator, Confluent Replicator
Resolved	Ottomata	T189713 Migrate eventbus camus to Kafka jumbo
Duplicate	Ottomata	T189716 Migrate EventStreams to Kafka Jumbo
Resolved	elukey	T175922 Use Prometheus for Kafka JMX metrics instead of jmxtrans
Resolved	Ottomata	T175923 Port Kafka alerts from check_graphite to check_prometheus
Resolved	elukey	T177078 Decide on casing convention for JMX metrics in Prometheus
Resolved	Ottomata	T177216 Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad
Resolved	Vgutierrez	T182993 TLS security review of the Kafka stack
Resolved	Ottomata	T184235 Puppet broken on deployment-kafka03 due to full disk
Resolved	elukey	T185262 Add IPv6 addresses for kafka-jumbo hosts
Resolved	elukey	T186598 Set up (temporary) IPSec for Kafka jumbo-eqiad cluster

Event Timeline

Ottomata created this task.Nov 30 2016, 4:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 30 2016, 4:55 PM

Ottomata added projects: Analytics, Analytics-Clusters.Nov 30 2016, 4:56 PM

Ottomata added subtasks: T121561: Encrypt Kafka traffic, and restrict access via ACLs, Restricted Task.

Ottomata added subscribers: elukey, faidon.

Ottomata added a subscriber: • Nuria.

Talked with Luca today, and with Analytics team about goals planning. We have a timeline for this.

We need to replace the existing Kafka brokers, since they are out of warranty. We also would like to rename/move the 'analytics' Kafka cluster. It is no longer (and shouldn't be) just for 'analytics' purposes. This beefy cluster acts more like a general purpose aggregate cluster, which can be used for both analytics and other services. We don't have the budget this FY to provision new Kafka brokers. We also don't have the bandwidth (at least in Q3) to upgrade Kafka and enable security features and be sure that all existing clients continue to work.

Given those constraints, we plan on waiting until Q1 next FY to provision a new 'aggregate' Kafka cluster outside the analytics VLAN, using the latest Kafka with security features enabled. We can then mirror topics from the 'analytics' cluster to this new one, and then one by one test and move Kafka clients to the new cluster, and then eventually decommission the 'analytics' Kakfa cluster.

So, yes, we will do this! But it will wait until next FY. (unless maybe there is ops budget to build a new Kafka cluster in Q4 this year... :) )

Ottomata mentioned this in T126989: MediaWiki logging & encryption.Dec 1 2016, 7:54 PM

• Nuria moved this task from Incoming to Dashiki on the Analytics board.Dec 5 2016, 4:43 PM

I just heard that there is some budget to start provisioning these clusters sooner rather than later! :)

Ottomata added a subtask: Unknown Object (Task).Mar 28 2017, 6:12 PM

Ottomata added a subtask: Unknown Object (Task).

@elukey

When provisioning the main-* clusters, we decided to use RAID-10 instead of JBOD. This was certainly the right choice for the main-* clusters usage profile, but I'm not sure if it is for this beefier cluster. I had thought that I had read that RAID was more recommended than JBOD, but I just read: http://docs.confluent.io/2.0.1/kafka/deployment.html#disks

Our recommendation is to configure your Kafka server with multiple log directories, each directory mounted on a separate drive.

So, either the recommendation has changed, or I am just remembering incorrectly.

Pros of RAID: Disk fails, broker stays online.
Cons of RAID: less storage capacity, slower writes. Disk fails means lots of IO during array rebuild.

Pros of JBOD: more storage capacity.
Cons of JBOD: Disk fails, broker goes offline. After replacement, all brokers participate in network IO to re-replicate partitions to disk.

Ohhh I dunno. It sure would be nice not to lose a whole broker for a single disk failure...

Ottomata edited projects, added Analytics-Kanban; removed Analytics.Mar 29 2017, 3:06 PM

Ottomata moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.

@Ottomata the perfect scenario would be to have a comparative test of performances on production hw but it would be really painful and hard to set up (if not please tell me!). It is indeed bad to alarm and page people every time a disk breaks for a broker, but we could revise your alarming to be a bit more gentle and page only in more severe cases.

Anyhow, the risk of slowing down writes it is too high in my opinion to try the raid 10 way, so I'd stick with the current configuration if possible and if the comparative test described above will not be possible/feasible.

RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Apr 12 2017, 12:54 AM

Milimetric triaged this task as Medium priority.May 8 2017, 2:06 PM

elukey moved this task from Backlog to Q4 2019/2020 on the Analytics-Clusters board.May 12 2017, 12:06 PM

Ottomata renamed this task from Provision new Kafka clusters in eqiad and codfw with security features to Provision new Kafka cluster(s) with security features.May 17 2017, 12:57 PM

Ottomata closed subtask Unknown Object (Task) as Declined.May 17 2017, 1:50 PM

We recently submitted two hardware orders to provision beefy aggregate clusters in both eqiad (T161636) and codfw (T161637). After some more in depth discussions with Luca and Joseph, we agreed that provisioning a new beefy codfw cluster isn't the best idea. We're still going to replace the 'analytics-eqiad' Kafka cluster with the hardware ordered in T161636 as planned. But instead of provisioning a new cluster in codfw, we would like to handle future production / cross-DC Kafka use cases with the main Kafka clusters that already exist in both primary DCs. The main Kafka clusters may need to be expanded to handle future use cases, such as T157088. I'll try to summarize the reasons for this decision here.

We handle cross DC replication in Kafka by prefixing topics with datacenter names, e.g. eqiad.mediawiki.revision-create. This allows us to whitelist topics for cross DC replication. Messages from eqiad producers go to the main-eqiad Kafka cluster in eqiad. prefixed topics, and vice versa for codfw. MirrorMaker is configured to replicate eqiad. topics from main-eqiad -> main-codfw, and vice versa.

This ticket is about refreshing the analytics-eqiad Kafka cluster, but it is also about promoting this system to something outside of Analytics. it is used for by others than just Analytics (discovery, ops, etc.), and we would like to encourage more use of this cluster. For the remainder of this ticket, I'll refer to the analyitcs-eqiad replacement cluster as aggregate-eqiad. (Final name still TBD). The main Kafka clusters are on weaker hardware than aggregate-eqiad, and are also in a more critical path than aggregate-eqiad. If they stop working, change-prop as well as other services may stop working. main Kafka clusters are intended for production uses and for uses where messages must be available in both primary DCs.

Our original plan with this ticket was to provision beefy not-critical-production Kafka clusters in both DCs so we could encourage more use of cross DC stream data. All topics would be datacenter prefixed, and we would configure mirror maker based cross DC replication in the same way we did for the main clusters. We had also talked about potentially merging the use of the main Kafka clusters into these new beefy clusters, if/when it made sense to do so.

However, this doesn't really fit the use case an aggregate cluster. We do need some topics replicated cross DC, but not all. Webrequest is huge compared to our other topics. Currently all DCs with caching hosts produce data to the webrequest topics directly to the analytics-eqiad Kafka cluster. What would we do with webrequest if we were to spin up aggregate clusters in both primary DCs? Where would webrequest log messages in ulsfo go? Would they go to codfw because it is closer? Would they then be prefixed with codfw., even though they originated in ulsfo? They'd have to be for our eqiad<->codfw replication setup to work. But what happens when we switch DCs? Also, what would the webrequests be used for in codfw? The Analytics Cluster does not exist there, not is it planned to create one. That'd be a lot of extra traffic between DCs for no good reason.

The new plan keeps the basic layout we have now. New production use cases that need cross-DC replication should use the main Kafka clusters.

Now we are left with a big budgeting question! Since we won't be provisioning the codfw aggregate cluster, we have some money that will disappear at the end of June! Should we use this to expand the main Kafka clusters? I'd like to get some feed back from folks working on T157088, so I'll poke over there about it. If we are going to use this budget to expand main Kafka clusters, we need to get a quote and place an order ASAP, likely by the end of next week (the 26th).

Ottomata mentioned this in T157088: [EPIC] Develop a JobQueue backend based on EventBus.May 18 2017, 8:54 AM

Ottomata mentioned this in T147718: RFC: New puppet code organization paradigm/coding standards.May 19 2017, 10:30 AM

Luca and I have started an etherpad to plan work here: https://etherpad.wikimedia.org/p/analytics-ops-kafka

These will be converted to phab tasks as we go.

Ottomata created subtask T166162: Update puppet for new Kafka cluster and version.May 23 2017, 5:58 PM

Ottomata created subtask T166167: Write generic certificate management software for use with Puppet and Self Signing CAs..May 23 2017, 7:31 PM

Ottomata added a parent task: T166833: Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field.Jun 1 2017, 8:19 PM

Requirements for Kafka TLS encryption and authentication:

All Kafka messages must be encrypted over the wire, broker to broker and broker to client.
Access to certain private data (like webrequest) topics will be restricted via Kafka ACLs
Clients will authenticate themselves (for ACL authorization) using TLS certs.

The above means that we need the following:

An automated way to generate broker and specific client CA signed keypairs, in both Java keystore format, as well as .pem.
An automated way to distribute keypairs and CA cert to brokers and clients

T166167 is about getting us that, but it might also be possible to use the puppetmaster CA to generate and distribute client keypairs.

It's also be nice to have automated management of Kafka ACLs, but we'll get to that later.

Ottomata created subtask T167304: Understand Kafka ACLs and figure out what ACLs we want for production topics.Jun 7 2017, 2:42 PM

faidon closed subtask Unknown Object (Task) as Resolved.Jun 16 2017, 12:58 AM

Ottomata created subtask T168538: Perf test RAID vs JBOD with new hardware and kafka versions.Jun 21 2017, 3:57 PM

Ottomata moved this task from Q4 2019/2020 to Q1 2020/2021 on the Analytics-Clusters board.Jul 11 2017, 3:11 PM

Ottomata mentioned this in T170523: deployment-kafka01 out of disk space.Jul 17 2017, 4:11 PM

• Nuria set the point value for this task to 0.Jul 31 2017, 3:41 PM

• Nuria closed subtask T168538: Perf test RAID vs JBOD with new hardware and kafka versions as Resolved.Aug 2 2017, 12:32 AM

• Nuria closed subtask T167304: Understand Kafka ACLs and figure out what ACLs we want for production topics as Resolved.Aug 8 2017, 7:54 PM

Hm, there is a 0.11.0.1 RC out, with some bugfixes for the 0.11. http://home.apache.org/~damianguy/kafka-0.11.0.1-rc0/RELEASE_NOTES.html

I think it would be wise to wait for this before we start porting clients to it. For now we can go ahead and install 0.11 on kafka-jumbo* to poke around. We'll wipe it and start anew with 0.11.0.1 when it is out.

• Nuria created subtask T175461: Port Kafka clients to new jumbo cluster.Sep 10 2017, 1:40 AM

• Nuria closed subtask T166162: Update puppet for new Kafka cluster and version as Resolved.Sep 12 2017, 8:46 PM

Ottomata created subtask T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans.Sep 14 2017, 2:53 PM

Ottomata created subtask T177216: Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad.Oct 2 2017, 1:39 PM

• Nuria closed subtask T175922: Use Prometheus for Kafka JMX metrics instead of jmxtrans as Resolved.Oct 9 2017, 4:38 PM

• Nuria closed subtask T177216: Mirror topics from main Kafka clusters (from main-eqiad) into jumbo-eqiad as Resolved.Nov 6 2017, 8:43 PM

Change 391649 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Update kafka-jumbo to kafka 0.11.0.1

https://gerrit.wikimedia.org/r/391649

Change 391649 merged by Ottomata:
[operations/puppet@production] Update kafka-jumbo to kafka 0.11.0.1

https://gerrit.wikimedia.org/r/391649

• Nuria closed subtask T166167: Write generic certificate management software for use with Puppet and Self Signing CAs. as Resolved.Dec 12 2017, 5:11 PM

• fdans added a subtask: T184235: Puppet broken on deployment-kafka03 due to full disk.Jan 8 2018, 4:41 PM