[SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	lbowmaker
	Nov 10 2021, 3:58 PM

Description

User Story

As a platform engineer I need to investigate and decide on a process for loading the Image Suggestions dataset to Cassandra

Success Criteria

Decision on solution for loading Image Suggestions dataset to Cassandra so that Structured Data can work on a subsequent ticket to load the data from their data pipeline work

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		lbowmaker	T293807 Data Persistence for Image Suggestions
		Resolved	Spike	lbowmaker	T295483 [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset

Event Timeline

lbowmaker created this task.Nov 10 2021, 3:58 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptNov 10 2021, 3:58 PM

lbowmaker moved this task from Product Roadmap to Investigate 🔍 on the Generated Data Platform board.Nov 10 2021, 3:58 PM

lbowmaker edited projects, added Generated Data Platform; removed Generated Data Platform (Product Roadmap).

lbowmaker added subscribers: gmodena, • Clarakosi, Eevans.Nov 10 2021, 4:06 PM

Subscribed DE - please add any comments/questions/suggestions.

Linked ticket: https://phabricator.wikimedia.org/T281517

lbowmaker mentioned this in T296758: Implement Cassandra Data Loader in Airflow.Nov 30 2021, 4:38 PM

It took some time and talks to get there!
Here is the idea we came to agree on: we wish to aim for separation of "job execution logic" versus "scheduling logic". This means we don't want Airflow to interact with execution logic more than with parameters passing. The execution logic should live in a separate repository than the Airflow DAGs.
While we have ideas to try to facilitate this approach even more, the fastest way to get the cassandra-loading airflow operator released is to make it a bash operator that will launch a spark job with a set of parameters. This implies to update the Spark cassandra loading job so that:

It can read the HQL query defining the data to load from a file stored on HDFS, passed as a parameter.
It applies Hive style variable substitution to the HQL query before runs it, variables passed as parameters.

More details about the change needed to the spark job will be documented in T281517.

Notes from meeting with Joseph, Cormac, Gabriele on March 7th:

Solution to load data via a Spark job > Cassandra exists and ready to use/test
Could be an issue with loading sets - untested but should work
Setting TTL on load is not currently possible, need to see other ways we could do that (set on table level?)
Short term approach is to work on Image Suggestions load to Cassandra using this approach, longer term create something more generic/reusable which is integrated more in Airflow
timeuuid in the Image Suggestion schema requires an extra step in the Spark job to call a Python Cassandra package utility to generate this and store as the correct type

Next steps:

See if we can deploy Image Suggestions schema (as is or as a test version) and allow the Structured Data team to test out functionality to load

lbowmaker updated the task description. (Show Details)Mar 7 2022, 2:03 PM

lbowmaker added a subscriber: mfossati.

lbowmaker renamed this task from [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Monthly Image Recs Dataset to [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset.Mar 7 2022, 2:08 PM

lbowmaker updated the task description. (Show Details)

lbowmaker moved this task from Investigate 🔍 to Work in Progress ⚙️ on the Generated Data Platform board.

In T295483#7756864, @lbowmaker wrote:

Notes from meeting with Joseph, Cormac, Gabriele on March 7th:

Solution to load data via a Spark job > Cassandra exists and ready to use/test

Could be an issue with loading sets - untested but should work

Setting TTL on load is not currently possible, need to see other ways we could do that (set on table level?)

It is possible to set a default for the table.

Short term approach is to work on Image Suggestions load to Cassandra using this approach, longer term create something more generic/reusable which is integrated more in Airflow

timeuuid in the Image Suggestion schema requires an extra step in the Spark job to call a Python Cassandra package utility to generate this and store as the correct type

I assumed there would be some way to pass in parameters (and since we'd want to use the exact same UUID for every record belonging to an import, this would be ideal). If so, there are lots of ways to create a type 1 UUID for the current date & time (i.e. from a shell via uuid -v1 for example).

lbowmaker moved this task from Work in Progress ⚙️ to Done 🎊 on the Generated Data Platform board.Jun 7 2022, 11:44 AM

lbowmaker closed this task as Resolved.Jun 9 2022, 1:30 PM