Page MenuHomePhabricator

Aggregate some numbers from scraper results
Open, Needs TriagePublic

Description

Goal

  • Surface metric numbers from scraper data
  • Think about how we support self-serve for accessing scraper data

Steps

Metrics:
Should be retrievable from current scraper results

  • # of duplicate (identical) refs in a given wiki
    • identical_refs_count in column E gives the absolute number of identical refs.
  • # of articles with at least one identical ref
    • pages_with_identical_refs_count in column J
    • proportion_of_pages_with_identical_refs in column AF for this number as a proportion of total pages.
  • # of articles with more than 25 refs and have at least one identical reference,
  • proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.
    • Assumption: longer reference lists have more duplicates because hard to find and manage
  • # of articles without references
    • pages_with_refs_count in column O for the number of pages with at least one ref.
    • proportion_of_pages_with_refs in column AI for this number as a proportion of total pages.
    • Requested metric can be found with page_count - pages_with_refs_count
  • ratio of reference to paragraph per wiki ( TBD: Can we even do that without a code change to the scraper and a re-run? )
    • wikitext_length_average in column C is a good proxy for paragraph count.

Code to review

Event Timeline

WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch subscribed.
WMDE-Fisch renamed this task from Scraper metrics to Aggregate some numbers from Scraper results.Wed, Apr 24, 10:33 AM
awight renamed this task from Aggregate some numbers from Scraper results to Aggregate some numbers from scraper results.Wed, Apr 24, 11:43 AM
awight updated the task description. (Show Details)

TODO: a bit of coding to reprocess existing page summarizes to produce the "25 refs or more" statistic.

proportion of duplicate refs in articles with >25 refs vs. proportion of duplicates in articles <25 refs, split by wiki.

FWIW, the naive calculation for this will be confounded by the increased chance of having an accidental duplicate as there are more refs.