These should read from the EventGate stream `event.mediawiki_wmde_page_summary` and output to a newly-created analytics tableWrite a new Airflow job which calls the Technical Wishes [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/ | scraper ]] to create per-page summaries, then aggregates into per-wiki summaries.
Limit scope to scrapinginitial scope to German Wikipedia, we will expand to other wikis in later work.
- [x] Draft a new Airflow job following a template.
- WIP: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commits/wmde-page-summary
- [] Write an Airflow sensor to detect that the Enterprise Snapshots are available.
- Will have to query the Enterprise API.{T414803}
- [] Call the scraper from Airflow:
- [] Add a conda-dist (see workflow-utils) build step to the scraper CI to produce a .tgz which packages Elixir, Erlang, and the compiled scraper objects.{T414804}
- [] Add this artifact to the Airflow "scrape" taskMake sure Enterprise credentials are wired through.
- `mix scrape dewiki`
- [] Write an Airflow sensor which can detect that the page summaries are completely imported into Hive.
- Wait for the hourly partition on the event.mediawiki_wmde_page_summary table for the hour *after* the scraper has completed successfully (or is there a more failsafe way to trigger the sensor immediately after the scraper completes?)
- At this point the page summary data is guaranteed to be complete in Hive.
- Expect a 3-hour lag between emitting the page summary events to Event Gate through Kafka and Gobblin, until finally persisting in Hive.
- [x] Recreate the "[[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md#wiki-summary | wiki summary ]]" metrics that were removed in [[ https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/commit/8ba951e6644835d11161d4200ce06bd153e71a21 | this patch ]].
- In review: https://gitlab.wikimedia.org/repos/wmde/analytics/-/merge_requests/20
- [] Call aggregation Spark SQL scripts from Airflow.
- [] Create the new aggregation tables in production Hive.
- [] Set hdfs permissions:
```
sudo -u analytics-wmde kerberos-run-command analytics-wmde
hdfs dfs -chown -R analytics-wmde:analytics-privatedata-users
```
- [] Enable job in production.
== Resources
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow
- https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide
- https://airflow.apache.org/docs/