Write a new Airflow job which calls the Technical Wishes scraper to create per-page summaries, then aggregates into per-wiki summaries.
Limit initial scope to German Wikipedia, we will expand to other wikis in later work.
- Draft a new Airflow job following a template.
- T414803: Write an Airflow sensor to detect Enterprise Snapshots
- Call the scraper from Airflow:
- T414804: Package scraper binary for Airflow job
- Make sure Enterprise credentials are wired through.
- mix scrape dewiki, can also scrape a smaller wiki like ffwiki to test the infrastructure.
- T414809: Write an Airflow sensor for scraper page summaries being fully imported to Hive
- Recreate the "wiki summary" metrics that were removed in this patch.
- Call aggregation Spark SQL scripts from Airflow.
- Create the new aggregation tables in production Hive.
- Set hdfs permissions, for example use these command snippets:
sudo -u analytics-wmde kerberos-run-command analytics-wmde hdfs dfs -chown -R analytics-wmde:analytics-privatedata-users
- Enable job in production.