Page MenuHomePhabricator

Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc.
Closed, ResolvedPublic

Description

Gobblin
Some time ago Gobblin was reviewed by us: T111409. Notes here: https://etherpad.wikimedia.org/p/gobblin-sprint

Now the project is incubating into Apache:

It doesn't seem abandoned, but not super active either (compared to other projects like Airflow etc..). Camus shows some sign of age and we are still unsure if the HDFS connector will be usable in the near future, I think that we'd need to start evaluating Camus replacements.

Marmaray

Released by Uber. Like Gobblin but built on Spark instead of MapReduce. Uses Uber's Hudi for HDFS upserts. Has ability to export from HDFS as well. Not very active, documentation sparse.

Kafka Connect + Kafka Connect HDFS

Kafka Connect is a generic Kafka source & sink framework. Kafka Connect HDFS is a HDFS + Hive sink. Its license was changed from Apache to a non FLOSS Confluent Community License 1.5 years ago. Supports Hive schema evolution.

Event Timeline

+1, we should consider Gobblin again. I was looking at it the other day a bit too. I'm not giving up on Kafka Connect yet, but I'm not hopeful.

Hm a very quick read of that looks pretty good!

Ottomata renamed this task from Evaluate (again) Gobblin as possible replacement for Camus to Evaluate possible replacements for Camus: Gobblin, Marmaryan, etc..Nov 18 2019, 4:40 PM
JAllemandou renamed this task from Evaluate possible replacements for Camus: Gobblin, Marmaryan, etc. to Evaluate possible replacements for Camus: Gobblin, Marmaray, etc..Nov 18 2019, 4:41 PM

We should def consider these things as we think about refactoring sanitization:

At Uber, all Kafka data is stored in append-only format with date-level partitions. The data for any specific user can span over multiple date partitions and will often have many Kafka records per partition. Scanning and updating all these partitions to correct, update, or delete user data can become very resource-intensive if the underlying storage doesn’t include built-in indexing and update support. The Parquet data stores used by Hadoop don’t support indexing, and we simply can’t update Parquet files in place. To facilitate indexing and update support, Marmaray instead uses Hadoop Updates and Incremental (Hudi), an open source library also developed at Uber that manages storage of large analytical datasets to store the raw data in Hive.

At a high level, data producers scan the table using Hive, identify records to be deleted, and publish them to a Kafka cluster with user-specific information removed or masked. Marmaray’s Kafka ingestion pipeline in turn reads them from the Kafka cluster, which has both new and updated (to-be-deleted) records. Marmaray then ingests pure new records using Hudi’s bulk insert feature, keeping ingestion latencies low, and process updated records using Hudi’s upsert feature to replace older Kafka records with newer modifications.

I just looked through a bit, and I can't seem to find any documentation about how to use Marmaray. The code isn't much help either. All I can find are the README and Joseph's linked blog post.

I do like the architecture of it more than Gobblin, but it might not be so easy...

Note for whoever will test this - we need to make sure that the new tool works with either TLS or SASL auth when pulling data from Kafka.

Ottomata added a project: Analytics-Kanban.

Why do all of these projects feel the need to implement their own version of JSONSchema?! (눈_눈)
https://gobblin.readthedocs.io/en/latest/user-guide/Source-schema-and-Converters/#schema-specification

I haven't tried Gobblin or Marmamay yet, but here are some quick thoughts.

Gobblin runs jobs in MapReduce, much like Camus. It is a huge Java project which has its pros and cons. It also hasn't had a release in 2.5 years. Based on their pull requests, they do seem to be slightly active recently, so perhaps they are re-investing in it and will make a release soon. They did just refactor some of their Kafka classes, which makes them more extensible (and easier to upgrade). The latest Kafka client they support is 0.9.0.1, which does have TLS support, but is still from many years ago. The Gobblin documentation is extensive, if not easy to understand quickly.

Marmaray is like Gobblin but based on Spark and Hudi. I think Marmaray would be a better fit for us, especially since so much of our ingestion pipeline is based on Spark. I don't see a lot of recent activity on github; the most recent patch merged to master was over a year ago. The documentation consists of only a blog post and the README. There are no examples of how to solve our use cases. The one example they do have is a built in HDFS -> Cassandra pipeline, which is pretty interesting given that we do that as well.

Kafka Connect HDFS's latest non CCL version is from about 1 year ago. Since then they've had several releases, but it is hard to say what exactly is in those releases. Confluent does have a changelog, but I think they only update it when they roll a new version of the Confluent Platform.


I still want to spike and try all three of these options, but I want to say that I'm currently leaning towards forking an old Kafka Connect HDFS, and here's why. I think it is better, but more relevantly: I think Gobblin and Marmaray will go the way of Camus. Kafka Connect is more than just Kafka Connect HDFS. Both Marmaray and Gobblin include complex job flow and execution management architectures. So does Kafka Connect, but that part of Kafka Connect is very active and part of the official open source Apache Kafka release. Kafka Connect HDFS is a single plugin offered by Confluent. I still need more evidence, but at the moment I think that using Kafka Connect with a forked Kafka Connect HDFS and maintaining that ourselves will be less of a headache than adopting a less active and not-as-good and more complex project maintained only by a single upstream company (ok Gobblin is in Apache Incubator, but look at the Incubator status page, no updates since "2017-02-23 Project enters incubation").

Am I biased?...yes :p I will try to do a fair comparison but I will need some likely need some shoves to anti-bias myself (cough cough Luca help me) :)

Ottomata renamed this task from Evaluate possible replacements for Camus: Gobblin, Marmaray, etc. to Evaluate possible replacements for Camus: Gobblin, Marmaray, Kafka Connect HDFS, etc..May 14 2020, 1:49 PM
Ottomata updated the task description. (Show Details)

A con against Kafka Connect: It does not run natively in Yarn. We'd have to run it in k8s or figure out how to run it in Yarn, or provision some Ganeti instances. In Hadoop 3, this would be done with the new Yarn Services API, which is kind of a built in Slider...but also works with docker containers! Pretty cool!