The monthly aggregation of Welcome Survey responses is currently done through a cron job that runs on @nettrom_WMF's personal account on stat1006. This is not an ideal setup as it makes it difficult for other members of the Product Analytics team to help out if updated are needed or something broke. Therefore, we'd want to centralize it and set it up in a more "production"-like data pipeline.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T294906 Puppet Improvements | |||
| Duplicate | jbond | T265138 Work required to prepare for puppet 7 | |||
| Resolved | SLyngshede-WMF | T273673 replace all puppet crons with systemd timers | |||
| Open | None | T316049 Unify all Product Analytics ETL jobs | |||
| Open | None | T296661 Product Analytics ETL Migration: Welcome Survey aggregates | |||
| Declined | None | T322533 MVP for Notebook Scheduler | |||
| Open | None | T340467 Enable Wmfdata-Python to access MariaDB replicas from the cluster | |||
| Open | None | T340469 Let user specify cnf to use when connecting to MariaDB | |||
| Open | None | T340472 Retrieve host & port info when connecting to MariaDB replicas on the cluster |
Event Timeline
We shouldn't do this until we have a better ETL solution, which will be worked on with Data Engineering in Q4
We will be migrating this job as part of T316049: Unify all Product Analytics ETL jobs, but because this is a notebook which requires MariaDB access it will be one of the last ones to be migrated while we sort out how notebook-based data pipelines work within the new system.
@nettrom_WMF: I just noticed the tables monthly_overview & response_aggregates are located in growth_welcomesurvey db. Should that data be relocated to wmf_product db as part of the (eventual) migration?
I realize there are now three options for scheduling the data pipeline in https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb via Airflow:
- Return to @mforns's POC for running Jupyter notebooks in Airflow (T322534#8467770), incorporating the code from @xcollazo's notebook to allow wmfdata-py to query MariaDB when running on worker nodes
- Abstract some of the nitty-gritty details in an easy-to-use function for querying against MariaDB replicas in an Airflow DAG and yank out all of the Python code out of the Welcome Survey aggregates notebook and into an Airflow DAG. (The "pure" approach.)
- Ingest/sqoop the necessary data into the data lake and forget about running notebooks & querying replicas and just make it a pure PySpark-based job.
I'm beginning to think #3 is the best option.
I would very much like to see option 1) finished and used by people...
I think if we could snap the fingers and have that ready, that would be the best option.
Sadly, this project has fallen in the cracks of the data teams reorg.
And I understand that you see option 3 as more realistic...
@VirginiaPoundstone is finishing the Airflow Jupyter Operator something we could include in our Q3 planning?
This has immense value, and has been requested for a looong time...
FYI, I used these data aggregates in the registration decline investigation I just finished (T378211) and I noticed that the data stops in September 2024, so it seems like something has broken with the existing script.
I know how time-consuming these Airflow migrations are, but it would be a shame to see much more of the data get lost!
@nettrom_WMF Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!