Page MenuHomePhabricator

Provision new Kafka cluster(s) with security features
Closed, ResolvedPublic0 Estimated Story Points

Description

The 'analytics-eqiad' Kafka cluster hardware is due to be refreshed. We also want to enable Kafka security features. We also want to move this large beefy Kafka cluster out of the 'Analytics Cluster' / Analytics VLAN, and make it a fully productionized Kafka cluster, available for use of production services. We may eventually merge the existent smaller main-* Kafka clusters into these new ones, but we don't have plans to do that within the next year.

This is the parent ticket for the provisioning, upgrade and migration plan for these new clusters.

Related Objects

StatusSubtypeAssignedTask
Declinedelukey
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DuplicateNone
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
Resolvedelukey
Resolvedelukey
Resolvedelukey
Resolved Cmjohnson
ResolvedRobH
ResolvedOttomata
DeclinedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedNone
Resolveddcausse
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DuplicateOttomata
Resolvedelukey
ResolvedOttomata
Resolvedelukey
ResolvedOttomata
ResolvedVgutierrez
ResolvedOttomata
Resolvedelukey
Resolvedelukey

Event Timeline

Talked with Luca today, and with Analytics team about goals planning. We have a timeline for this.

We need to replace the existing Kafka brokers, since they are out of warranty. We also would like to rename/move the 'analytics' Kafka cluster. It is no longer (and shouldn't be) just for 'analytics' purposes. This beefy cluster acts more like a general purpose aggregate cluster, which can be used for both analytics and other services. We don't have the budget this FY to provision new Kafka brokers. We also don't have the bandwidth (at least in Q3) to upgrade Kafka and enable security features and be sure that all existing clients continue to work.

Given those constraints, we plan on waiting until Q1 next FY to provision a new 'aggregate' Kafka cluster outside the analytics VLAN, using the latest Kafka with security features enabled. We can then mirror topics from the 'analytics' cluster to this new one, and then one by one test and move Kafka clients to the new cluster, and then eventually decommission the 'analytics' Kakfa cluster.

So, yes, we will do this! But it will wait until next FY. (unless maybe there is ops budget to build a new Kafka cluster in Q4 this year... :) )

Ottomata renamed this task from Kafka Security Features to Provision new Kafka clusters in eqiad and codfw with security features.Mar 28 2017, 5:58 PM
Ottomata updated the task description. (Show Details)

I just heard that there is some budget to start provisioning these clusters sooner rather than later! :)

Ottomata added a subtask: Unknown Object (Task).Mar 28 2017, 6:12 PM
Ottomata added a subtask: Unknown Object (Task).

@elukey

When provisioning the main-* clusters, we decided to use RAID-10 instead of JBOD. This was certainly the right choice for the main-* clusters usage profile, but I'm not sure if it is for this beefier cluster. I had thought that I had read that RAID was more recommended than JBOD, but I just read: http://docs.confluent.io/2.0.1/kafka/deployment.html#disks

Our recommendation is to configure your Kafka server with multiple log directories, each directory mounted on a separate drive.

So, either the recommendation has changed, or I am just remembering incorrectly.

Pros of RAID: Disk fails, broker stays online.
Cons of RAID: less storage capacity, slower writes. Disk fails means lots of IO during array rebuild.

Pros of JBOD: more storage capacity.
Cons of JBOD: Disk fails, broker goes offline. After replacement, all brokers participate in network IO to re-replicate partitions to disk.

Ohhh I dunno. It sure would be nice not to lose a whole broker for a single disk failure...

@Ottomata the perfect scenario would be to have a comparative test of performances on production hw but it would be really painful and hard to set up (if not please tell me!). It is indeed bad to alarm and page people every time a disk breaks for a broker, but we could revise your alarming to be a bit more gentle and page only in more severe cases.

Anyhow, the risk of slowing down writes it is too high in my opinion to try the raid 10 way, so I'd stick with the current configuration if possible and if the comparative test described above will not be possible/feasible.

RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Apr 12 2017, 12:54 AM
Milimetric triaged this task as Medium priority.May 8 2017, 2:06 PM
Ottomata renamed this task from Provision new Kafka clusters in eqiad and codfw with security features to Provision new Kafka cluster(s) with security features.May 17 2017, 12:57 PM
Ottomata closed subtask Unknown Object (Task) as Declined.May 17 2017, 1:50 PM

We recently submitted two hardware orders to provision beefy aggregate clusters in both eqiad (T161636) and codfw (T161637). After some more in depth discussions with Luca and Joseph, we agreed that provisioning a new beefy codfw cluster isn't the best idea. We're still going to replace the 'analytics-eqiad' Kafka cluster with the hardware ordered in T161636 as planned. But instead of provisioning a new cluster in codfw, we would like to handle future production / cross-DC Kafka use cases with the main Kafka clusters that already exist in both primary DCs. The main Kafka clusters may need to be expanded to handle future use cases, such as T157088. I'll try to summarize the reasons for this decision here.

We handle cross DC replication in Kafka by prefixing topics with datacenter names, e.g. eqiad.mediawiki.revision-create. This allows us to whitelist topics for cross DC replication. Messages from eqiad producers go to the main-eqiad Kafka cluster in eqiad. prefixed topics, and vice versa for codfw. MirrorMaker is configured to replicate eqiad. topics from main-eqiad -> main-codfw, and vice versa.

This ticket is about refreshing the analytics-eqiad Kafka cluster, but it is also about promoting this system to something outside of Analytics. it is used for by others than just Analytics (discovery, ops, etc.), and we would like to encourage more use of this cluster. For the remainder of this ticket, I'll refer to the analyitcs-eqiad replacement cluster as aggregate-eqiad. (Final name still TBD). The main Kafka clusters are on weaker hardware than aggregate-eqiad, and are also in a more critical path than aggregate-eqiad. If they stop working, change-prop as well as other services may stop working. main Kafka clusters are intended for production uses and for uses where messages must be available in both primary DCs.

Our original plan with this ticket was to provision beefy not-critical-production Kafka clusters in both DCs so we could encourage more use of cross DC stream data. All topics would be datacenter prefixed, and we would configure mirror maker based cross DC replication in the same way we did for the main clusters. We had also talked about potentially merging the use of the main Kafka clusters into these new beefy clusters, if/when it made sense to do so.

However, this doesn't really fit the use case an aggregate cluster. We do need some topics replicated cross DC, but not all. Webrequest is huge compared to our other topics. Currently all DCs with caching hosts produce data to the webrequest topics directly to the analytics-eqiad Kafka cluster. What would we do with webrequest if we were to spin up aggregate clusters in both primary DCs? Where would webrequest log messages in ulsfo go? Would they go to codfw because it is closer? Would they then be prefixed with codfw., even though they originated in ulsfo? They'd have to be for our eqiad<->codfw replication setup to work. But what happens when we switch DCs? Also, what would the webrequests be used for in codfw? The Analytics Cluster does not exist there, not is it planned to create one. That'd be a lot of extra traffic between DCs for no good reason.

The new plan keeps the basic layout we have now. New production use cases that need cross-DC replication should use the main Kafka clusters.

Now we are left with a big budgeting question! Since we won't be provisioning the codfw aggregate cluster, we have some money that will disappear at the end of June! Should we use this to expand the main Kafka clusters? I'd like to get some feed back from folks working on T157088, so I'll poke over there about it. If we are going to use this budget to expand main Kafka clusters, we need to get a quote and place an order ASAP, likely by the end of next week (the 26th).

Luca and I have started an etherpad to plan work here: https://etherpad.wikimedia.org/p/analytics-ops-kafka

These will be converted to phab tasks as we go.

Requirements for Kafka TLS encryption and authentication:

  • All Kafka messages must be encrypted over the wire, broker to broker and broker to client.
  • Access to certain private data (like webrequest) topics will be restricted via Kafka ACLs
  • Clients will authenticate themselves (for ACL authorization) using TLS certs.

The above means that we need the following:

  • An automated way to generate broker and specific client CA signed keypairs, in both Java keystore format, as well as .pem.
  • An automated way to distribute keypairs and CA cert to brokers and clients

T166167 is about getting us that, but it might also be possible to use the puppetmaster CA to generate and distribute client keypairs.

It's also be nice to have automated management of Kafka ACLs, but we'll get to that later.

faidon closed subtask Unknown Object (Task) as Resolved.Jun 16 2017, 12:58 AM
Nuria set the point value for this task to 0.Jul 31 2017, 3:41 PM

Hm, there is a 0.11.0.1 RC out, with some bugfixes for the 0.11. http://home.apache.org/~damianguy/kafka-0.11.0.1-rc0/RELEASE_NOTES.html

I think it would be wise to wait for this before we start porting clients to it. For now we can go ahead and install 0.11 on kafka-jumbo* to poke around. We'll wipe it and start anew with 0.11.0.1 when it is out.

Change 391649 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Update kafka-jumbo to kafka 0.11.0.1

https://gerrit.wikimedia.org/r/391649

Change 391649 merged by Ottomata:
[operations/puppet@production] Update kafka-jumbo to kafka 0.11.0.1

https://gerrit.wikimedia.org/r/391649

Nuria moved this task from Dashiki to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.