The monthly aggregation of Welcome Survey responses is currently done through a cron job that runs on @nettrom_WMF's personal account on stat1006. This is not an ideal setup as it makes it difficult for other members of the Product Analytics team to help out if updated are needed or something broke. Therefore, we'd want to centralize it and set it up in a more "production"-like data pipeline.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T294906 Puppet Improvements | |||
Duplicate | jbond | T265138 Work required to prepare for puppet 7 | |||
Resolved | SLyngshede-WMF | T273673 replace all puppet crons with systemd timers | |||
Open | mpopov | T316049 Unify all Product Analytics ETL jobs | |||
Open | nettrom_WMF | T296661 Product Analytics ETL Migration: Welcome Survey aggregates | |||
Open | None | T322533 MVP for Notebook Scheduler | |||
Open | None | T340467 Enable wmfdata-py to access MariaDB replicas on the cluster | |||
Open | None | T340469 Let user specify cnf to use when connecting to MariaDB | |||
Open | None | T340472 Retrieve host & port info when connecting to MariaDB replicas on the cluster |
Event Timeline
We shouldn't do this until we have a better ETL solution, which will be worked on with Data Engineering in Q4
We will be migrating this job as part of T316049: Unify all Product Analytics ETL jobs, but because this is a notebook which requires MariaDB access it will be one of the last ones to be migrated while we sort out how notebook-based data pipelines work within the new system.
@nettrom_WMF: I just noticed the tables monthly_overview & response_aggregates are located in growth_welcomesurvey db. Should that data be relocated to wmf_product db as part of the (eventual) migration?
I realize there are now three options for scheduling the data pipeline in https://github.com/nettrom/Growth-welcomesurvey-2018/blob/master/T275172_survey_aggregation.ipynb via Airflow:
- Return to @mforns's POC for running Jupyter notebooks in Airflow (T322534#8467770), incorporating the code from @xcollazo's notebook to allow wmfdata-py to query MariaDB when running on worker nodes
- Abstract some of the nitty-gritty details in an easy-to-use function for querying against MariaDB replicas in an Airflow DAG and yank out all of the Python code out of the Welcome Survey aggregates notebook and into an Airflow DAG. (The "pure" approach.)
- Ingest/sqoop the necessary data into the data lake and forget about running notebooks & querying replicas and just make it a pure PySpark-based job.
I'm beginning to think #3 is the best option.
I would very much like to see option 1) finished and used by people...
I think if we could snap the fingers and have that ready, that would be the best option.
Sadly, this project has fallen in the cracks of the data teams reorg.
And I understand that you see option 3 as more realistic...
@VirginiaPoundstone is finishing the Airflow Jupyter Operator something we could include in our Q3 planning?
This has immense value, and has been requested for a looong time...