These should read from the EventGate stream `event.mediawiki_wmde_page_summary` and output to a newly-created analytics table.
Limit scope to scraping German Wikipedia, we will expand to other wikis in later work.
- [x] Draft a new Airflow job following a template.
- WIP: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commits/wmde-page-summary
- [] Write an Airflow sensor to detect that the Enterprise Snapshots are available.
- Will have to query the Enterprise API.
- [] Call the scraper from Airflow:
- Add a conda-dist (see workflow-utils) build step to the scraper CI to produce a .tgz which packages Elixir, Erlang, and the compiled scraper objects.
- Add this artifact to the Airflow "scrape" task.
- `mix scrape dewiki`
- [] Write an Airflow sensor which can detect that the page summaries are completely imported into Hive.
- Wait for the hourly partition on the event.mediawiki_wmde_page_summary table for the hour *after* the scraper has completed (or is there a failsafe way to trigger the sensor immediately after the scraper completes?)
- At this point the page summary data is guaranteed to be in Hive.
- Expect a 3-hour lag between emitting the page summary events to Event Gate through Kafka and Gobblin, to persisting to Hive.
- [x] Recreate the "[[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md#wiki-summary | wiki summary ]]" metrics that were removed in [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/commit/8ba951e6644835d11161d4200ce06bd153e71a21 | this patch ]].
- In review: https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/20
- [] Call aggregation Spark SQL scripts from Airflow.
- [] Create the new aggregation tables in production.
- [] Enable job in production.
== Resources
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide
- https://airflow.apache.org/docs/