Page MenuHomePhabricator

Implement Cassandra Data Loader in Airflow
Closed, ResolvedPublic

Description

User Story
As a pipeline developer I need the capability to load generated datasets to Cassandra so data can be persisted from the latest run

To load data in Cassandra, client teams will need to persist data to hive/parquet and schedule the updated HiveToCassandra spark job authored by data analytics.

The bulk of this story should be documenting the process and provide config boilerplate to help integration,
so that we have a generic guideline for this use case.

Success Criteria
  • A data pipeline processes and outputs data to Cassandra tables
  • Boilerplate as been added to our DAG template to configure a java/scala Spark Job
  • Documentation and design doc have been updated

[] Nice to have: Can loading component be re-usable for other data pipelines in Airflow? (depends on approach - TBD)we agreed to use a BashOperator for now.

Notes:

Event Timeline

gmodena added a subscriber: mfossati.
gmodena added a subscriber: Cparle.

Resolving, this was a POC to prove loading data to Cassandra - this work has been built on in the image suggestions data pipeline job.