Change Details

==== **Problem Statement** > ==== Right now, only analysts who are comfortable with data engineering practices are given limited ability to schedule jobs. We need a system to facilitate the automated scheduling of jobs that is appropriately accessible for analysts. ==== Spike Outcomes: [x] How can we schedule notebooks in airflow [x] Write a simple NotebookOperator [x] Build a one-off Conda env that runs Jupyter notebooks and papermill [x] Write a test DAG that runs a notebook in Airflow [x] Test it works [x] Investigate PAs Jobs/Notebooks intended to be scheduled [x] What data will the notebooks need to access? [x] Which engines (Hive, Spark, R, others?) will the notebooks need to run? [x] What types of outputs do the notebooks produce (Hive, reports, dashboards?) ==== Maybe part of this spike? Or maybe should be a separate task. [] Ownership Map -> What we will own vs What others will own [] Show idea (NotebookOperator in Airflow DAG, using single conda env automatically packaged by CI) to Product Analytics and ask if they would like that. [] Discuss who would take care of writing DAGs, testing DAGs, reviewing DAG code, merging, deploying Airflow, receiving alerts, troubleshooting failed DAGs, updating the notebooks conda env with new libraries, etc.