Page MenuHomePhabricator

Spike: Product Analytics ETL options - Timebox 1 Sprint.
Closed, ResolvedPublic9 Estimated Story Points

Description

Problem Statement
Right now, only analysts who are comfortable with data engineering practices are given limited ability to schedule jobs. We need a system to facilitate the automated scheduling of jobs that is appropriately accessible for analysts.
Spike Outcomes:
  • How can we schedule notebooks in airflow
    • Write a simple NotebookOperator
    • Build a one-off Conda env that runs Jupyter notebooks and papermill
    • Write a test DAG that runs a notebook in Airflow
    • Test it works
  • Investigate PAs Jobs/Notebooks intended to be scheduled
    • What data will the notebooks need to access?
    • Which engines (Hive, Spark, R, others?) will the notebooks need to run?
    • What types of outputs do the notebooks produce (Hive, reports, dashboards?)

Event Timeline

EChetty updated the task description. (Show Details)
EChetty set the point value for this task to 5.
EChetty renamed this task from Spike: Notebook Schedular options. to Spike: Notebook Schedular options - Timebox 1 Sprint..Nov 8 2022, 1:17 PM
EChetty changed the point value for this task from 5 to 9.
EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 04) board.
mforns renamed this task from Spike: Notebook Schedular options - Timebox 1 Sprint. to Spike: Product Analytics ETL options - Timebox 1 Sprint..Nov 8 2022, 4:21 PM

I believe this spike work is finished.


We have implemented a simple P.O.C. that confirms we can run Jupyter notebooks in Airflow.
Here's the GitLab merge request of the P.O.C., we can use this as a reference if/when we implement this feature.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/196
I created a task to potentially productionize it: T325185.

For the test we also created a packaged conda environment using conda-pack, that contained the necessary libraries to execute naive jupyter notebooks using papermill.
This was the code used on a stats machine:

export http_proxy="http://webproxy.eqiad.wmnet:8080"
export https_proxy="http://webproxy.eqiad.wmnet:8080"
source /usr/lib/airflow/etc/profile.d/conda.sh
conda create --name notebook_operator_env
conda activate notebook_operator_env
conda install python=3.10
conda install jupyter
conda install -c conda-forge papermill
conda install -c conda-forge conda-pack
conda-pack
conda deactivate

After that, I put it in HDFS under the analytics-privatedata user with proper permissions.
For this part, I also created a new task, to set up a repository for the conda environment generation.
This repository should build PA's packaged conda environments for the execution of Jupyter notebooks via CI.
T325195


We also have looked into the list of notebooks scheduled by the PA team to collect requirements for the feature.
I added a new sheet to the Airflow migration spreadsheet with that information (PA Jobs):
https://docs.google.com/spreadsheets/d/1lfK5Idteh6zPSlCWyH34FJCl_Lcm8401Wm59Jgk-7wM
The conclusions of this analysis are in https://phabricator.wikimedia.org/T322666#8460702


I split part of the initial objectives of this task to a separate task T325181, since we discussed in standup that we'd do this over the course of the next couple weeks.