Page MenuHomePhabricator

Evaluate possible replacements for Camus: Gobblin, Marmaray, etc.
Open, HighPublic

Description

Some time ago Gobblin was reviewed by us: T111409

Now the project is incubating into Apache:

It doesn't seem abandoned, but not super active either (compared to other projects like Airflow etc..). Camus shows some sign of age and we are still unsure if the HDFS connector will be usable in the near future, I think that we'd need to start evaluating Camus replacements.

Event Timeline

elukey created this task.Fri, Nov 15, 10:04 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFri, Nov 15, 10:04 AM

+1, we should consider Gobblin again. I was looking at it the other day a bit too. I'm not giving up on Kafka Connect yet, but I'm not hopeful.

Another brand new tools we could have a look at: https://github.com/uber/marmaray

Hm a very quick read of that looks pretty good!

Ottomata renamed this task from Evaluate (again) Gobblin as possible replacement for Camus to Evaluate possible replacements for Camus: Gobblin, Marmaryan, etc..Mon, Nov 18, 4:40 PM
JAllemandou renamed this task from Evaluate possible replacements for Camus: Gobblin, Marmaryan, etc. to Evaluate possible replacements for Camus: Gobblin, Marmaray, etc..Mon, Nov 18, 4:41 PM
Ottomata triaged this task as High priority.Mon, Nov 18, 4:41 PM
Ottomata moved this task from Incoming to Modern Event Platform on the Analytics board.
Ottomata added a project: Event-Platform.

We should def consider these things as we think about refactoring sanitization:

At Uber, all Kafka data is stored in append-only format with date-level partitions. The data for any specific user can span over multiple date partitions and will often have many Kafka records per partition. Scanning and updating all these partitions to correct, update, or delete user data can become very resource-intensive if the underlying storage doesn’t include built-in indexing and update support. The Parquet data stores used by Hadoop don’t support indexing, and we simply can’t update Parquet files in place. To facilitate indexing and update support, Marmaray instead uses Hadoop Updates and Incremental (Hudi), an open source library also developed at Uber that manages storage of large analytical datasets to store the raw data in Hive.

At a high level, data producers scan the table using Hive, identify records to be deleted, and publish them to a Kafka cluster with user-specific information removed or masked. Marmaray’s Kafka ingestion pipeline in turn reads them from the Kafka cluster, which has both new and updated (to-be-deleted) records. Marmaray then ingests pure new records using Hudi’s bulk insert feature, keeping ingestion latencies low, and process updated records using Hudi’s upsert feature to replace older Kafka records with newer modifications.

I just looked through a bit, and I can't seem to find any documentation about how to use Marmaray. The code isn't much help either. All I can find are the README and Joseph's linked blog post.

I do like the architecture of it more than Gobblin, but it might not be so easy...