This subtask is about finishing the Airflow integration.
Status:
- [x] Make it possible to execute Elixir from a SimpleSkeinOperator
- [x] Make it possible to execute Elixir from a BashOperator and BashSensor
- [x] Establish network egress from operators and sensors
- [x] Sensor can detect snapshot existence
- [x] Scraper runs successfully
- [x] Data is loaded into Hive
- Follow https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/c151bd31a66f320b53267d427cc588b184f5b056/main/dags/commons/commons_impact_metrics_monthly_dag.py#L65 to copy outputs into hdfs and load from there.
- [x] Per-wiki aggregation
- [x] Optional: Improve diagnostics from pipeline components
- [x] https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex/-/merge_requests/34
- [x] https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/175
- [x] Needs test fix: https://gitlab.com/wmde/technical-wishes/mediawiki_client_ex/-/merge_requests/35
SQL and Airflow job:
- [x] https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/22
- [ ] To review: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1957
We can now run the full job in an Airflow devenv. Only small deployment steps should be required to move this into production:
- [ ] Final package release of the scraper, from the main branch.
- [ ] Warm up the artifact cache with this packaged release.
- [ ] Create the tables in production.
- [x] Write some [[ https://docs.google.com/document/d/112CVLDoCLvNEQMI436zqL1Xy-5XlMf6XL6lNvhdsSiw/edit?tab=t.0 | reflections ]] on Airflow integration on the Wikimedia infrastructure.