This subtask is about finishing the Airflow integration.
Status:
- Make it possible to execute Elixir from a SimpleSkeinOperator
- Make it possible to execute Elixir from a BashOperator and BashSensor
- Establish network egress from operators and sensors
- Sensor can detect snapshot existence
- Scraper runs successfully
- Data is loaded into Hive
- Follow https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/c151bd31a66f320b53267d427cc588b184f5b056/main/dags/commons/commons_impact_metrics_monthly_dag.py#L65 to copy outputs into hdfs and load from there.
- Per-wiki aggregation
- Optional: Improve diagnostics from pipeline components
SQL and Airflow job:
- https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/22
- https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1957
We can now run the full job in an Airflow devenv. Only small deployment steps should be required to move this into production:
- Final package release of the scraper, from the main branch.
- Warm up the artifact cache with this packaged release.
- Create the tables in production.
Follow up:
- Write some reflections on Airflow integration on the Wikimedia infrastructure. (Will become a "deep dive" talk on 5 May.)
- Document the new tables in https://datahub.wikimedia.org/ , including the completeness and time frame of backfilled data.