Page MenuHomePhabricator

Orchestrate gobblin ingestion task with Airflow and config store.
Open, Needs TriagePublic8 Estimated Story Points

Description

Gobblin is the system used at Wikimedia to ingest data from Kafka topics into HDFS. Currently Gobblin ingestion is triggered by a systemd timer.

We should move the orchestration logic into an airflow (dynamic) dag, and its configuration into the dataset config store.

Part of the work was spiked in T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?. The following work is required:

Gobblin is a critical piece of infra, and an API boundary between Event Platform and the Data Lake. This phab will result in a a number of design decisions.
We should

  • Provide a Design Document that illustrates how this system works, and how it will be integrated with the refinery refactoring effort.
  • Provide a decision record documenting the "whys".

References

Event Timeline

Ahoelzl set the point value for this task to 8.Apr 4 2024, 12:32 AM
gmodena renamed this task from [NEEDS GROOMING] Orchestrate gobblin ingestion task with Airflow to Orchestrate gobblin ingestion task with Airflow and config store..May 15 2024, 1:23 PM
gmodena updated the task description. (Show Details)

Gobblin uses ESC to discover streams to ingest. Given that we will not be removing support for automated stream -> Hive ingestion, we need to make sure that gobblin still automatically ingests stuff, without extra commit and deployment steps.

Provide a schema for Gobbling config. As a starting point, consider the outcome of this spike work https://phabricator.wikimedia.org/T361017#9754921

In your prototype, I see that the stream to ingest is explicitly configured per gobblin operator. IIUC, This will mean hundreds of manually managed Gobblin jobs in Datasets Config repo.

As proposed, this would remove support for automated stream -> Hive ingestion.

Can we close this? I don't think this is relevant anymore.