Page MenuHomePhabricator

Basic aggregation from intermediate format
Closed, ResolvedPublic

Description

Write a simple aggregator which takes a summarized output from parse_wiki.exs and produces some simple statistics. We will refine these later, the point here is just to set up an initial framework that we can build on.

Suggested aggregations:

  • Average ref_count per page.
  • Average transclusion_count per page.
  • Average ref_by_transclusion per page.
  • Union of all unique potential_ref_transclusions.

Output format could be a new CSV file, or a formatted report. - Decided to go with one JSON line for each wiki so we can combine them later.

Review: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/7