Gobblin is the system used at Wikimedia to ingest data from Kafka topics into HDFS. Currently Gobblin ingestion is triggered by a systemd timer.
We should move the orchestration logic into an airflow (dynamic) dag, and its configuration into the dataset config store.
Part of the work was spiked in T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?. The following work is required:
- Create an airflow operator to launch the MapReduce job from Airflow / skein.
- Refactor Gobblin's config, align with T365005: Evaluate ESC and explore an alternative design.. E.g. remove (potentially) unnecessary filter conditions (and matching consumer blocks in ESC).
- Provide a schema for Gobbling config. As a starting point, consider the outcome of this spike work https://phabricator.wikimedia.org/T361017#9754921
- Move Gobblin pull configs from refinery to config store value files.
- Remove systemd config from puppet.
Gobblin is a critical piece of infra, and an API boundary between Event Platform and the Data Lake. This phab will result in a a number of design decisions.
We should
- Provide a Design Document that illustrates how this system works, and how it will be integrated with the refinery refactoring effort.
- Provide a decision record documenting the "whys".
References