Spike: Product Analytics ETL options - Timebox 1 Sprint.
Closed, ResolvedPublic9 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Nov 7 2022, 12:19 PM

Description

Problem Statement

Right now, only analysts who are comfortable with data engineering practices are given limited ability to schedule jobs. We need a system to facilitate the automated scheduling of jobs that is appropriately accessible for analysts.

Spike Outcomes:

How can we schedule notebooks in airflow
- Write a simple NotebookOperator
- Build a one-off Conda env that runs Jupyter notebooks and papermill
- Write a test DAG that runs a notebook in Airflow
- Test it works
Investigate PAs Jobs/Notebooks intended to be scheduled
- What data will the notebooks need to access?
- Which engines (Hive, Spark, R, others?) will the notebooks need to run?
- What types of outputs do the notebooks produce (Hive, reports, dashboards?)

Related Objects
Search...

Status	Assigned	Task
Open	mpopov	T316049 Unify all Product Analytics ETL jobs
Open	None	T322532 Notebook Scheduler for Product Analytics
Resolved	mforns	T322534 Spike: Product Analytics ETL options - Timebox 1 Sprint.
Resolved	mpopov	T322666 Links for Product Analytics Jobs.
Open	lbowmaker	T325181 Present "Notebooks in Airflow" solution to PA and discuss ownership of different steps

Event Timeline

• EChetty created this task.Nov 7 2022, 12:19 PM

• EChetty removed a project: Epic.

• EChetty updated the task description. (Show Details)Nov 7 2022, 4:58 PM

• EChetty updated the task description. (Show Details)

• EChetty set the point value for this task to 5.

• EChetty moved this task from Backlog to Sprint 04 on the Data Pipelines board.Nov 7 2022, 5:05 PM

• EChetty edited projects, added Data Pipelines (Sprint 04); removed Data Pipelines.

• EChetty assigned this task to mforns.Nov 8 2022, 1:11 PM

• EChetty updated the task description. (Show Details)

• EChetty renamed this task from Spike: Notebook Schedular options. to Spike: Notebook Schedular options - Timebox 1 Sprint..Nov 8 2022, 1:17 PM

• EChetty changed the point value for this task from 5 to 9.

• EChetty moved this task from Ready to In Progress on the Data Pipelines (Sprint 04) board.

• EChetty updated the task description. (Show Details)Nov 8 2022, 1:20 PM

• EChetty updated the task description. (Show Details)

mforns renamed this task from Spike: Notebook Schedular options - Timebox 1 Sprint. to Spike: Product Analytics ETL options - Timebox 1 Sprint..Nov 8 2022, 4:21 PM

mpopov closed subtask T322666: Links for Product Analytics Jobs. as Resolved.Nov 21 2022, 7:52 PM

mpopov subscribed.Nov 23 2022, 8:05 PM

1 Week in this sprint :)

mforns updated the task description. (Show Details)Dec 12 2022, 3:44 PM

mforns updated the task description. (Show Details)Dec 14 2022, 3:34 PM

mforns added a subtask: T325181: Present "Notebooks in Airflow" solution to PA and discuss ownership of different steps.

mforns mentioned this in T325185: [Airflow] Implement a NotebookOperator.Dec 14 2022, 3:47 PM

I believe this spike work is finished.

We have implemented a simple P.O.C. that confirms we can run Jupyter notebooks in Airflow.
Here's the GitLab merge request of the P.O.C., we can use this as a reference if/when we implement this feature.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/196
I created a task to potentially productionize it: T325185.

For the test we also created a packaged conda environment using conda-pack, that contained the necessary libraries to execute naive jupyter notebooks using papermill.
This was the code used on a stats machine:

export http_proxy="http://webproxy.eqiad.wmnet:8080"
export https_proxy="http://webproxy.eqiad.wmnet:8080"
source /usr/lib/airflow/etc/profile.d/conda.sh
conda create --name notebook_operator_env
conda activate notebook_operator_env
conda install python=3.10
conda install jupyter
conda install -c conda-forge papermill
conda install -c conda-forge conda-pack
conda-pack
conda deactivate

After that, I put it in HDFS under the analytics-privatedata user with proper permissions.
For this part, I also created a new task, to set up a repository for the conda environment generation.
This repository should build PA's packaged conda environments for the execution of Jupyter notebooks via CI.
T325195

We also have looked into the list of notebooks scheduled by the PA team to collect requirements for the feature.
I added a new sheet to the Airflow migration spreadsheet with that information (PA Jobs):
https://docs.google.com/spreadsheets/d/1lfK5Idteh6zPSlCWyH34FJCl_Lcm8401Wm59Jgk-7wM
The conclusions of this analysis are in https://phabricator.wikimedia.org/T322666#8460702

I split part of the initial objectives of this task to a separate task T325181, since we discussed in standup that we'd do this over the course of the next couple weeks.

mforns moved this task from In Progress to Done on the Data Pipelines (Sprint 05-06) board.Dec 14 2022, 4:05 PM

• EChetty closed this task as Resolved.Jan 17 2023, 11:38 AM

Mayakp.wiki subscribed.Feb 21 2023, 6:50 PM

OSefu-WMF subscribed.Nov 15 2023, 8:09 PM

mpopov mentioned this in T296661: Product Analytics ETL Migration: Welcome Survey aggregates.Jan 11 2024, 2:52 PM