Page MenuHomePhabricator

Scraper: Add new metrics for sub-ref data
Closed, ResolvedPublic

Description

To better understand how sub-refs are used on the wikis we want to add new aggregations to the scraper:

Per-page metrics

Some of these are intermediate values which were not originally included in the task.

  • main_ref_count - count of main refs (excluding subrefs)
  • reflist_item_count - total number of references in the references list ( "ref_count - ref_reuse_counts" or count of ref_bodies)
  • ratio_subrefs_to_main_refs - ratio of total number of sub-refs divided by number of main refs
  • ref_reuse_count - Simple count of reference reuses.
  • reflist_subref_item_count - total number of subreferences in the references list (excludes reuses)
  • potential_subref_transclusions - list of templates producing sub-refs
  • subref_error_counts_by_type - List of each Cite error in subref tags, as a map of how many times each error occurs on the page.
  • subref_reuse_count - Number of duplicate subreferences, as detected by taking the difference between footnote marker count and reflist item count.
  • subrefs_with_errors_count - Number of subref tags which are marked with an error.
  • transclusions_inside_subrefs - list of templates used in sub-refs

Hive

  • Add new fields to EventGate export and to draft event schema.

Per-wiki aggregations

NOTE: We will export per-page analysis into Hive and will make the aggregations there. See T410719.
  • total number of articles with sub-references
  • total number of duplicate sub-references (per main ref) per article, per wiki
  • list of articles where sub-refs are produced by templates
  • list of templates used in sub-refs
  • ratio of sub-refs per main ref, for articles having sub-refs

Out of scope

Here we would need to decide on how we want to define these groups

  • list of sub-ref contents to allow grouping by type (e.g. page number, video/podcast timestamp, etc.)

Implementation notes

  • Create a new test fixture with Parsoid HTML from a subref-containing page, eg https://test.wikipedia.org/api/rest_v1/page/html/CiteDetailsTests and write a basic tests analyzing it, which we can change as new metrics are introduced.
  • We can identify subrefs by taking the analyzed ref tags and checking for the details attribute in data-mw.
  • Reused subrefs will be compared by text, but will be identifiable by comparing mainRefId once Parsoid supports subref deduplication.
    • Once subref reuse is supported, we'll probably want to count subref list items vs. reuses/total subrefs.

Open questions

  • What is "ratio of sub-refs" measuring—what's the underlying question about?
  • Should we measure the number of subrefs per main ref?
  • Should we measure the string length of subrefs?
  • What format to use for the "list of articles"? What is the analytics use case?

Implementation

Code for review:

Event Timeline

awight updated the task description. (Show Details)

Note: This is currently parked because we want to do aggregations in Superset. T410719: Persist scraper outputs to Hive is the requirement we're currently working on that will move the scraper's raw data to Hive where we can use it in Superset then.

WMDE-Fisch added a subscriber: awight.

Note: This is currently parked because we want to do aggregations in Superset. T410719: Persist scraper outputs to Hive is the requirement we're currently working on that will move the scraper's raw data to Hive where we can use it in Superset then.

After discussing this with @awight again we agree that it makes more sense to do these aggregations in the scraper for now. Hive is more a mid term goal.

awight renamed this task from Scraper: Add new aggregations for sub-ref data to Scraper: Add new metrics for sub-ref data.Dec 8 2025, 9:02 AM
awight updated the task description. (Show Details)

Splitting per-page from per-wiki metrics so we can implement the base analysis.

awight updated the task description. (Show Details)
awight updated the task description. (Show Details)