Making the scraper results queryable from SQL is a huge win for analytics, for example the outputs can be plugged into a Superset dashboard.
Currently, the outputs are only in .ndjson and .csv files under https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs/, but these can only be analyzed after import or by using eg. jq.
- Plan Hive schema, map data types.
- Page summary schema should match internal analysis, use discrete fields and native data types, not a JSON blob.
- Maybe truncate long lists eg. of templates used on the page? (product question, not blocking)
- Note: map<string, int> is troublesome because an empty map always defaults to map<string, string>
- Upstream protocols must preserve structure
- Write event schema: https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/merge_requests/100
- Decide on database and namespace.
- Proposal: wmde.page_summary, wmde.wiki_summary
- EventGate: wmde_page_analysis
- Scraper persists outputs in Hive.
- Refactor pipeline.ex so that page summarization is a stream from input (enterprise or local file) to output (hive exporter, local file...)
- Output module to export to EventGate. Are batch inserts possible?
- Provide example HiveQL queries.
- Write Airflow or dbt jobs. See https://gitlab.wikimedia.org/repos/data-engineering/dbt-jobs
- Include sensors and alerts. "wiki snapshot is completely analyzed", "monthly snapshot never arrived"...
- Refine EventGate data into independent page_summary and wiki_summary tables.
- Keep 13 months of snapshots so we can analyze month-on-month.
- Take over per-wiki aggregation (then remove aggregation code from scraper).
Schema
The per-page table will hold all analysis outputs, mapped into Hive native data types. The analytics table we produce will be partitioned by snapshot date and by wiki.
Event schema is here: https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/merge_requests/100/diffs#9bbb8dc587021608f5b2a2a221211594a6a57591 .
Out of scope
There are no known external consumers of the published analytics files, so no obligation to keep producing the old file format. Published files should be deleted.