Problem Statement
Right now, only analysts who are comfortable with data engineering practices are given limited ability to schedule jobs. We need a system to facilitate the automated scheduling of jobs that is appropriately accessible for analysts.
Spike Outcomes:
- How can we schedule notebooks in airflow
- Write a simple NotebookOperator
- Build a one-off Conda env that runs Jupyter notebooks and papermill
- Write a test DAG that runs a notebook in Airflow
- Test it works
- Investigate PAs Jobs/Notebooks intended to be scheduled
- What data will the notebooks need to access?
- Which engines (Hive, Spark, R, others?) will the notebooks need to run?
- What types of outputs do the notebooks produce (Hive, reports, dashboards?)