After the base pipeline is ready T358699, and the API and Cassandra/Druid datasource design are ready T358679,
we should create an Airflow DAG that formats and loads the monthly data to Cassandra.
This work includes writing SparkSql queries to format the data into the desired shape.
The DAG should execute those queries, and load the results to Cassandra.
This DAG should also be monthly, and run right after the base pipeline finishes (a sensor should wait for the base datasets to be present for the month in question).
This task will also require to coordinate with Eric Evans to create the tables in Cassandra.
Tasks:
- Coordinate with Eric Evans to create the tables in Cassandra.
- Write the queries that format the base Commons Impact Metrics datasets into the expected shape.
- Write the Airflow DAG that waits for the base data to be present, executes the queries and loads the data to Cassandra.
- Test in Airflow's dev instance
- Vet the data in Cassandra
- Code-review and deploy
Definition of done:
- The queries work properly and are in the corresponding repo (probably refinery?)
- The DAG is in production and running
- The data is accessible in Cassandra as expected.