Page MenuHomePhabricator

Persist scraper outputs to Hive
Open, Needs TriagePublic

Description

Making the scraper results queryable from SQL is a huge win for analytics, for example the outputs can be plugged into a Superset dashboard.

Currently, the outputs are only in .ndjson and .csv files under https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs/, but these can only be analyzed after import or by using eg. jq.

  • Plan Hive schema, map data types.
  • Decide on database and namespace.
    • Proposal: wmde.page_summary, wmde.wiki_summary
    • EventGate: wmde_page_analysis
  • Scraper persists outputs in Hive.
  • Provide example HiveQL queries.
  • Write Airflow or dbt jobs. See https://gitlab.wikimedia.org/repos/data-engineering/dbt-jobs
    • Include sensors and alerts. "wiki snapshot is completely analyzed", "monthly snapshot never arrived"...
    • Refine EventGate data into independent page_summary and wiki_summary tables.
    • Keep 13 months of snapshots so we can analyze month-on-month.
    • Take over per-wiki aggregation (then remove aggregation code from scraper).

Schema

The per-page table will hold all analysis outputs, mapped into Hive native data types. The analytics table we produce will be partitioned by snapshot date and by wiki.

Event schema is here: https://gitlab.wikimedia.org/repos/data-engineering/schemas-event-secondary/-/merge_requests/100/diffs#9bbb8dc587021608f5b2a2a221211594a6a57591 .

Out of scope

There are no known external consumers of the published analytics files, so no obligation to keep producing the old file format. Published files should be deleted.

Event Timeline

A new direction: we'll try sending the rows to EventGate and will use an Airflow or dbt job to transform the events into a smaller table for dashboards.

awight removed awight as the assignee of this task.Dec 3 2025, 5:06 PM