Page MenuHomePhabricator

Replace Camus with Kafka Connect for event data imports
Open, HighPublic

Description

Camus imports data from Kafka and writes to HDFS. It is old and unsupported, and does not use a modern Kafka client. Unfortunetly, the Kafka Connect HDFS Plugin is written by Confluent, and was recently placed under their Confluent Community License. We will either have to get approval to use the HDFS connector under this license, or fork the connector from before the license change.

Event Timeline

Ottomata created this task.May 17 2019, 3:49 PM
Milimetric triaged this task as High priority.May 20 2019, 3:36 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.

It looks like the latest kafka-connect-hdfs release not under the CCL is 5.0.3, released on March 24th this year.

@Nuria @faidon o/

It is time to start the discussion about the Confluent Community License. This is complicated, and I know we have 'policies'(?) about using FLOSS/OSI-OSD? licenses only. The main Kafka Connect stuff is Apache license, but some of the plugins e.g. kafka-connect-hdfs are from Confluent and are licensed with CCL.

I'd really like to use kafka-connect-hdfs to replace Camus. I'd much rather use up to date releases, rather than forking from a pre-CCL version and maintaining from there. Faidon and I talked about this a bit at All-Hands this year. I understand our intentions for open source only, but I also understand why Confluent has chosen to use this new license. For them it is the best they can do as a for-profit company. They want to make free and open software, without allowing others to use that software to compete directly with their revenue model. More info in this CCL FAQ and this blog post.

Since hosting a streaming SaaS is quite far outside of WMF's mission, I don't think the CCL conflicts with our intentions of using open source, even if it technically conflicts with FLOSS/OSI-OSD. I think Faidon differs. :)

Let's discuss! Who else should we add to this conversation?

Unfortunately, I think this is one of the matters that we cannot fully discuss in a public task. I'll start a private email thread; if anyone reading this is interested to be part of this, ping me off-list and I can loop you in :)

Ok!

Other questions too. Assuming we will use Kafka Connect (and hopefully kafka-connect-hdfs) in some capacity, we'll have to figure out where and how to run it. It seems the easiest way to run a distributed cluster is to do so via Kubernetes. For this task, we could alternatively run this in YARN, but that would take a lot of manual work (and use of Slider?) to do. I'd of course prefer if we did something more standard like Kubernetes. But, this would mean that we run a Kubernetes service that can read from Kafka and write to HDFS & Hive. A distributed Kafka Connect cluster running in k8s would accept actual Connect jobs via its REST API. For this task, we'd have a single job reading from Kafka topics and writing to HDFS directories & Hive tables. We might want to have more Connect jobs in the future, but those are outside the scope of this task. Those could be run in the same Kafka Connect cluster or a totally separate one.

See also https://docs.confluent.io/current/connect/concepts.html#connect-workers and https://docs.confluent.io/current/connect/userguide.html#distributed-worker-configuration

@akosiaris thoughts about running this in k8s? Our timeline is beginning to use this in Q2.