User Story
As a pipeline developer I need the capability to load generated datasets to Cassandra so data can be persisted from the latest run
To load data in Cassandra, client teams will need to persist data to hive/parquet and schedule the updated HiveToCassandra spark job authored by data analytics.
The bulk of this story should be documenting the process and provide config boilerplate to help integration,
so that we have a generic guideline for this use case.
Success Criteria
- A data pipeline processes and outputs data to Cassandra tables
- Boilerplate as been added to our DAG template to configure a java/scala Spark Job
- Documentation and design doc have been updated
[] Nice to have: Can loading component be re-usable for other data pipelines in Airflow? (depends on approach - TBD)we agreed to use a BashOperator for now.
Notes:
- Depends on approach in: https://phabricator.wikimedia.org/T295483
- When ready will need implementation of draft schema: https://phabricator.wikimedia.org/T295405 from designs in: https://phabricator.wikimedia.org/T293808