Product Analytics ETL Migration: Welcome Survey aggregates
Open, MediumPublic
Actions

Assigned To

Authored By

	nettrom_WMF
	Nov 29 2021, 5:42 PM

Description

The monthly aggregation of Welcome Survey responses is currently done through a cron job that runs on @nettrom_WMF's personal account on stat1006. This is not an ideal setup as it makes it difficult for other members of the Product Analytics team to help out if updated are needed or something broke. Therefore, we'd want to centralize it and set it up in a more "production"-like data pipeline.

Related Objects
Search...

Status	Assigned	Task
Open	None	T294906 Puppet Improvements
Duplicate	jbond	T265138 Work required to prepare for puppet 7
Resolved	SLyngshede-WMF	T273673 replace all puppet crons with systemd timers
Open	mpopov	T316049 Unify all Product Analytics ETL jobs
Open	nettrom_WMF	T296661 Product Analytics ETL Migration: Welcome Survey aggregates
Open	None	T322533 MVP for Notebook Scheduler
Open	None	T340467 Enable wmfdata-py to access MariaDB replicas on the cluster
Open	None	T340469 Let user specify cnf to use when connecting to MariaDB
Open	None	T340472 Retrieve host & port info when connecting to MariaDB replicas on the cluster

Event Timeline

nettrom_WMF created this task.Nov 29 2021, 5:42 PM

ldelench_wmf triaged this task as Medium priority.Nov 30 2021, 6:06 PM

ldelench_wmf moved this task from Triage to Upcoming Quarter on the Product-Analytics board.

Tgr moved this task from Inbox to Upcoming Work on the Growth-Team board.Dec 1 2021, 4:29 AM

Dzahn added a parent task: T273673: replace all puppet crons with systemd timers.Feb 11 2022, 7:18 PM

mpopov updated Other Assignee, added: nettrom_WMF.Jun 28 2022, 5:15 PM

MShilova_WMF moved this task from Upcoming Work to Triaged on the Growth-Team board.Nov 7 2022, 6:24 PM

We shouldn't do this until we have a better ETL solution, which will be worked on with Data Engineering in Q4

mpopov added a parent task: T316049: Unify all Product Analytics ETL jobs.Mar 27 2023, 6:24 PM

mpopov added a subtask: T322533: MVP for Notebook Scheduler.

We will be migrating this job as part of T316049: Unify all Product Analytics ETL jobs, but because this is a notebook which requires MariaDB access it will be one of the last ones to be migrated while we sort out how notebook-based data pipelines work within the new system.

mpopov renamed this task from Welcome Survey: move cron job to analytics-product system user to Product Analytics ETL Migration: Welcome Survey aggregates.Mar 27 2023, 6:31 PM

mpopov assigned this task to nettrom_WMF.

mpopov updated Other Assignee, removed: nettrom_WMF.

mpopov updated the task description. (Show Details)

@nettrom_WMF: I just noticed the tables monthly_overview & response_aggregates are located in growth_welcomesurvey db. Should that data be relocated to wmf_product db as part of the (eventual) migration?

I realize there are now three options for scheduling the data pipeline in https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb via Airflow:

Return to @mforns's POC for running Jupyter notebooks in Airflow (T322534#8467770), incorporating the code from @xcollazo's notebook to allow wmfdata-py to query MariaDB when running on worker nodes
Abstract some of the nitty-gritty details in an easy-to-use function for querying against MariaDB replicas in an Airflow DAG and yank out all of the Python code out of the Welcome Survey aggregates notebook and into an Airflow DAG. (The "pure" approach.)
Ingest/sqoop the necessary data into the data lake and forget about running notebooks & querying replicas and just make it a pure PySpark-based job.

I'm beginning to think #3 is the best option.

I would very much like to see option 1) finished and used by people...
I think if we could snap the fingers and have that ready, that would be the best option.
Sadly, this project has fallen in the cracks of the data teams reorg.
And I understand that you see option 3 as more realistic...

@VirginiaPoundstone is finishing the Airflow Jupyter Operator something we could include in our Q3 planning?
This has immense value, and has been requested for a looong time...

Product Analytics ETL Migration: Welcome Survey aggregatesOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Product Analytics ETL Migration: Welcome Survey aggregates
Open, MediumPublic
Actions

Related Objects
Search...