Page MenuHomePhabricator

WME Pageviews DAG for HDFS to S3 Transfer
Open, In Progress, HighPublic

Description

Create an hourly Airflow DAG to transfer pageview partitions from WMF HDFS to WME S3.

Credentials via IAM Anywhere (no static AWS keys)
Streaming transfer via hdfs dfs -cat + boto3.upload_fileobj
S3 writes use AES256 server-side encryption
Egress via url-downloader.eqiad.wikimedia.org:8080
Deterministic S3 keys — retries safely overwrite

Acceptance Criteria

DAG runs successfully for interval
All 4 partitions transferred and validated (Content-Length > 0)
Retries produce correct output
No static AWS credentials used

Image
docker-registry.wikimedia.org/repos/wme/pageviews-hdfs-transfer:v0.1.1

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Draft: Added pageview_wme_transferrepos/data-engineering/airflow-dags!2203sg912feature/pageview-wme-transfermain
Customize query in GitLab