Page MenuHomePhabricator

Aggregate all per-page fields in some way
Closed, ResolvedPublic

Description

In T332053 we aggregated most of the basic fields, but a few remain. This task is finished when the fields listed below have an aggregation.

Not all statistics are easily summarized, for example "unique name" can be aggregated as "average number of unique names on each page", or "average of unique name count as a proportion of refs_with_names", among many possibilities. For this task, pick one or two simple aggregations for each field, and make notes about future aggregations to investigate.

Fields to aggregate:

  • reflist_count
    • Proportion of pages with exactly one reflist
    • Proportion of pages with more than one reflist?
    • Average number of reflists, when non-zero.
  • ref_with_name_count
    • Average proportion of refs with name.
    • Proportion of pages with named refs.
  • ref_reuse_counts
    • Average maximum reuse count.
    • Proportion of pages with reference reuse.
    • Future: Gini inequality coefficient?
  • unique_name_count
    • Averge proportion of names that are unique (note that 50% is the maximum)
  • refs_with_transclusions_count
    • Proportion of refs produced within a transclusion.
  • transclusions_inside_refs
    • Proportion of refs containing a transclusion.
  • ref_count
    • Proportion of pages with no refs
  • automatic_ref_name_count
    • Proportion of pages where automatically-named refs appear.
  • TBD: additional ref statistics should be implemented along with their basic aggregation.

Out of scope:

  • mapdata

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/36