Page MenuHomePhabricator

[SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset
Closed, ResolvedPublicSpike

Description

User Story
As a platform engineer I need to investigate and decide on a process for loading the Image Suggestions dataset to Cassandra
Success Criteria
  • Decision on solution for loading Image Suggestions dataset to Cassandra so that Structured Data can work on a subsequent ticket to load the data from their data pipeline work

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptNov 10 2021, 3:58 PM

Subscribed DE - please add any comments/questions/suggestions.

It took some time and talks to get there!
Here is the idea we came to agree on: we wish to aim for separation of "job execution logic" versus "scheduling logic". This means we don't want Airflow to interact with execution logic more than with parameters passing. The execution logic should live in a separate repository than the Airflow DAGs.
While we have ideas to try to facilitate this approach even more, the fastest way to get the cassandra-loading airflow operator released is to make it a bash operator that will launch a spark job with a set of parameters. This implies to update the Spark cassandra loading job so that:

  • It can read the HQL query defining the data to load from a file stored on HDFS, passed as a parameter.
  • It applies Hive style variable substitution to the HQL query before runs it, variables passed as parameters.

More details about the change needed to the spark job will be documented in T281517.

Notes from meeting with Joseph, Cormac, Gabriele on March 7th:

  • Solution to load data via a Spark job > Cassandra exists and ready to use/test
  • Could be an issue with loading sets - untested but should work
  • Setting TTL on load is not currently possible, need to see other ways we could do that (set on table level?)
  • Short term approach is to work on Image Suggestions load to Cassandra using this approach, longer term create something more generic/reusable which is integrated more in Airflow
  • timeuuid in the Image Suggestion schema requires an extra step in the Spark job to call a Python Cassandra package utility to generate this and store as the correct type

Next steps:

  • See if we can deploy Image Suggestions schema (as is or as a test version) and allow the Structured Data team to test out functionality to load
lbowmaker renamed this task from [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Monthly Image Recs Dataset to [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset.Mar 7 2022, 2:08 PM
lbowmaker updated the task description. (Show Details)

Notes from meeting with Joseph, Cormac, Gabriele on March 7th:

  • Solution to load data via a Spark job > Cassandra exists and ready to use/test
  • Could be an issue with loading sets - untested but should work
  • Setting TTL on load is not currently possible, need to see other ways we could do that (set on table level?)

It is possible to set a default for the table.

  • Short term approach is to work on Image Suggestions load to Cassandra using this approach, longer term create something more generic/reusable which is integrated more in Airflow
  • timeuuid in the Image Suggestion schema requires an extra step in the Spark job to call a Python Cassandra package utility to generate this and store as the correct type

I assumed there would be some way to pass in parameters (and since we'd want to use the exact same UUID for every record belonging to an import, this would be ideal). If so, there are lots of ways to create a type 1 UUID for the current date & time (i.e. from a shell via uuid -v1 for example).

lbowmaker updated the task description. (Show Details)