As part of WE5.4.3, we need to export a subset of data that currently lives in Hive (event.development_network_probe) into the production Puppet servers.
The goal is to run a query daily, export its results, and make them available to Puppet so they can be deployed to the CDN servers.
Proposed approach (from Data Engineering feedback on Slack):
- Generate the file daily on HDFS with Airflow + Spark.
- Use an "archiver" to manage file naming and ensure consistency.
- Configure the Puppet server to fetch the file from HDFS (e.g., via hdfs_rsync) and then deploy it to the CDN hosts.
Open questions / next steps:
- Is configuring the Puppet server with HDFS access acceptable?
- Do we have any existing Airflow DAG that we can use as a template? Check https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags
Acceptance criteria:
- Daily job exports query results to HDFS.
- Puppet server fetches and stores the file in a way usable by Puppet manifests.
- File is deployed to CDN servers via the normal Puppet workflows.
