Maniphest T296758

Implement Cassandra Data Loader in Airflow
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	lbowmaker
	Nov 30 2021, 4:38 PM

Tags

Referenced Files

None

Subscribers

Description

User Story

As a pipeline developer I need the capability to load generated datasets to Cassandra so data can be persisted from the latest run

To load data in Cassandra, client teams will need to persist data to hive/parquet and schedule the updated HiveToCassandra spark job authored by data analytics.

The bulk of this story should be documenting the process and provide config boilerplate to help integration,
so that we have a generic guideline for this use case.

Success Criteria

A data pipeline processes and outputs data to Cassandra tables
Boilerplate as been added to our DAG template to configure a java/scala Spark Job
Documentation and design doc have been updated

~~[] Nice to have: Can loading component be re-usable for other data pipelines in Airflow? (depends on approach - TBD)~~we agreed to use a BashOperator for now.

Notes:

Depends on approach in: https://phabricator.wikimedia.org/T295483
When ready will need implementation of draft schema: https://phabricator.wikimedia.org/T295405 from designs in: https://phabricator.wikimedia.org/T293808

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		lbowmaker	T293807 Data Persistence for Image Suggestions
		Resolved		gmodena	T296758 Implement Cassandra Data Loader in Airflow

Event Timeline

lbowmaker created this task.Nov 30 2021, 4:38 PM

lbowmaker moved this task from Product Roadmap to Backlog on the Generated Data Platform board.Feb 1 2022, 8:17 PM

lbowmaker edited projects, added Generated Data Platform; removed Generated Data Platform (Product Roadmap).

gmodena moved this task from Backlog to Investigate 🔍 on the Generated Data Platform board.Mar 15 2022, 2:16 PM

gmodena mentioned this in T281517: 📊[PLACEHOLDER] We should implement a data loader for Cassandra.Mar 16 2022, 11:56 AM

gmodena claimed this task.Mar 16 2022, 12:08 PM

gmodena updated the task description. (Show Details)

gmodena updated the task description. (Show Details)Mar 16 2022, 12:22 PM

gmodena added a subscriber: mfossati.

gmodena added a subscriber: Cparle.

gmodena moved this task from Investigate 🔍 to Ready/Groomed 📚 on the Generated Data Platform board.Mar 16 2022, 3:59 PM

gmodena moved this task from Ready/Groomed 📚 to Work in Progress ⚙️ on the Generated Data Platform board.

gmodena updated the task description. (Show Details)Mar 17 2022, 9:28 AM

gmodena updated the task description. (Show Details)Mar 17 2022, 9:30 AM

Work in progress at:

gmodena moved this task from Work in Progress ⚙️ to QA/Review ❓ on the Generated Data Platform board.Mar 28 2022, 7:01 PM

lbowmaker moved this task from QA/Review ❓ to Done 🎊 on the Generated Data Platform board.Jun 9 2022, 11:11 AM

Resolving, this was a POC to prove loading data to Cassandra - this work has been built on in the image suggestions data pipeline job.

lbowmaker closed this task as Resolved.Jun 9 2022, 1:04 PM