- - Use airflow, with a dag:
- - allow for both 1. Run WDTK → create CSVs → make aggregations pre-db → load data into DB → make aggregation-post-db
- - allow for paraellized transformations (replace sql permuations in loop)
- - see if templated tasks would work
[ x ] - see if wikipedia cloud has a spark cluster already?
- - try to execute a dummy spark job
- - profiling of each node in dag, to time sections