Page MenuHomePhabricator

Product Analytics ETL Migration: Welcome Survey aggregates
Open, MediumPublic

Description

The monthly aggregation of Welcome Survey responses is currently done through a cron job that runs on @nettrom_WMF's personal account on stat1006. This is not an ideal setup as it makes it difficult for other members of the Product Analytics team to help out if updated are needed or something broke. Therefore, we'd want to centralize it and set it up in a more "production"-like data pipeline.

Event Timeline

ldelench_wmf moved this task from Triage to Upcoming Quarter on the Product-Analytics board.
mpopov subscribed.

We shouldn't do this until we have a better ETL solution, which will be worked on with Data Engineering in Q4

We will be migrating this job as part of T316049: Unify all Product Analytics ETL jobs, but because this is a notebook which requires MariaDB access it will be one of the last ones to be migrated while we sort out how notebook-based data pipelines work within the new system.

mpopov renamed this task from Welcome Survey: move cron job to analytics-product system user to Product Analytics ETL Migration: Welcome Survey aggregates.Mar 27 2023, 6:31 PM
mpopov assigned this task to nettrom_WMF.
mpopov updated Other Assignee, removed: nettrom_WMF.
mpopov updated the task description. (Show Details)

@nettrom_WMF: I just noticed the tables monthly_overview & response_aggregates are located in growth_welcomesurvey db. Should that data be relocated to wmf_product db as part of the (eventual) migration?

I realize there are now three options for scheduling the data pipeline in https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb via Airflow:

  1. Return to @mforns's POC for running Jupyter notebooks in Airflow (T322534#8467770), incorporating the code from @xcollazo's notebook to allow wmfdata-py to query MariaDB when running on worker nodes
  2. Abstract some of the nitty-gritty details in an easy-to-use function for querying against MariaDB replicas in an Airflow DAG and yank out all of the Python code out of the Welcome Survey aggregates notebook and into an Airflow DAG. (The "pure" approach.)
  3. Ingest/sqoop the necessary data into the data lake and forget about running notebooks & querying replicas and just make it a pure PySpark-based job.

I'm beginning to think #3 is the best option.

I would very much like to see option 1) finished and used by people...
I think if we could snap the fingers and have that ready, that would be the best option.
Sadly, this project has fallen in the cracks of the data teams reorg.
And I understand that you see option 3 as more realistic...

@VirginiaPoundstone is finishing the Airflow Jupyter Operator something we could include in our Q3 planning?
This has immense value, and has been requested for a looong time...