These should read from the EventGate stream `event.mediawiki_wmde_page_summary` and output to a newly-created analytics table.
Limit scope to scraping German Wikipedia, we will expand to other wikis in later work.
- [x] Draft a new Airflow job following a template.
- WIP: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commits/wmde-page-summary
- [] Write an Airflow sensor to detect that the Enterprise Snapshots are available.
- Will have to query the Enterprise API.
- [] Write an Airflow sensor which can detect that the page summaries are completely imported into Hive.
- Wait for the hourly partition *after* the scraper is completed to appear in Hive. This is important because there's a c. 3-hour lag between emitting the page summary events and appearing in Hive.
- [x] Recreate the "[[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md#wiki-summary | wiki summary ]]" metrics that were removed in [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/commit/8ba951e6644835d11161d4200ce06bd153e71a21 | this patch ]].
- In review: https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/20
- [] Call aggregation Spark SQL scripts from Airflow
- [] Create the new aggregation tables in production.
== Resources
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide
- https://airflow.apache.org/docs/