Write a new Airflow job which calls the Technical Wishes [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/ | scraper ]] to create per-page summaries, then aggregates into per-wiki summaries.
Limit initial scope to German Wikipedia, we will expand to other wikis in later work.
- [x] Draft a new Airflow job following a template.
- WIP: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commits/wmde-page-summary
- [] {T414803}
- [] Call the scraper from Airflow:
- [] {T414804}
- [] Make sure Enterprise credentials are wired through.
- `mix scrape dewiki`, can also scrape a smaller wiki like `ffwiki` to test the infrastructure.
- [] {T414803}
- [x] Recreate the "[[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md#wiki-summary | wiki summary ]]" metrics that were removed in [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/commit/8ba951e6644835d11161d4200ce06bd153e71a21 | this patch ]].
- In review: https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/20
- [] Call aggregation Spark SQL scripts from Airflow.
- [] Create the new aggregation tables in production Hive.
- [] Set hdfs permissions, for example use these command snippets:
```
sudo -u analytics-wmde kerberos-run-command analytics-wmde
hdfs dfs -chown -R analytics-wmde:analytics-privatedata-users
```
- [] Enable job in production.
== Resources
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide
- https://airflow.apache.org/docs/