Page MenuHomePhabricator

Create baseline statistics for reference usage (2023)
Closed, ResolvedPublic

Description

We want to have a number of metrics and statistics for usages of references so we check the impact of the changes we are going to make. See also spreadsheet.

Research questions

  • How impactful might work on citations be? (How important are references to Wikipedia? How prevalent are they?)
    • How many references (ref tags) exist per page on average per wiki, etc
    • How many footnote markers appear in each rendered page?
    • How many footnote bodies render to a Cite error?
    • How many of the ref footnote tags come from a list-defined reference?
    • How many ref footnotes are produced under a template transclusion?
    • How many <references /> sections are rendered on the page?
  • How often are references reused using current available methods?
    • How many ref tags are a pure reuse, eg. just a "name" and no body?
    • How many template transclusions were made on the page?
    • How often are the different templates used for reusing references invoked e.g. x% of pages on dewiki use the sfn template - replicating Citation templates across different wikis
    • What templates produce refs for each wiki?
    • How often are named references used per wiki on average e.g. x% of pages on dewiki use named references
    • How many different ref “name”s are defined?
  • How often references may benefit from being reusable in a more flexible/easier way?
    • How often are references reused using existing methods, that could be replaced with a simpler approach e.g. x% of references on a page on dewiki are an instance of reuse of a reference that already exists on that page

Implementation: Scrape rendered HTML dumps

We've discovered that the Wikimedia Enterprise HTML dumps give us a reliable way to analyze ref tags in their final rendering. This is challenging with any other data source, since ref tags are often produced by templates and therefore don't appear in the wikitext.

Our proof-of-concept scraper can be found here: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/

Outcome

https://docs.google.com/spreadsheets/d/1q71Swzxpf2U4shhSJl8fry-CHg1RauJXBlXN_PHSlk4/edit#gid=0

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
Resolvedawight
Resolvedawight
Resolvedthiemowmde
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
Resolvedawight
ResolvedNone
Resolvedawight
Resolvedawight
Resolvedtaavi
ResolvedAndrew
Resolvedawight
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
ResolvedBUG REPORTProtsack.stephan
ResolvedNone
ResolvedNone
ResolvedNone
Resolvedawight
ResolvedNone

Event Timeline

lilients_WMDE renamed this task from Scrape rendered HTML dumps to get baseline statistics for reference usage to Create baseline statistics for reference usage.Mar 17 2023, 3:56 PM
lilients_WMDE updated the task description. (Show Details)

We expect to run the job next week, once the upstream blocking task is resolved.

Moving epic into current board to reflect that we're finalizing the first report run.

awight moved this task from Watching / Epic to Review on the WMDE-TechWish-Maintenance-2023 board.
awight added a subscriber: Lena_WMDE.

@Lena_WMDE The output is finally ready for review!

Spreadsheet:

Documentation for columns: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md

A good next step would be to put this CSV into a google spreadsheet, clean up the column order, and then do a review of how we might tweak the aggregation. Re-running the aggregation is cheap and fun.

A good next step would be to put this CSV into a google spreadsheet, clean up the column order, and then do a review of how we might tweak the aggregation. Re-running the aggregation is cheap and fun.

I created a spreadsheet: https://docs.google.com/spreadsheets/d/1q71Swzxpf2U4shhSJl8fry-CHg1RauJXBlXN_PHSlk4/edit#gid=0
I will clean it up a bit, so @Lena_WMDE can work with it. :)

awight claimed this task.
awight moved this task from Watching / Epic to Done on the WMDE-TechWish-Maintenance-2023 board.
awight renamed this task from Create baseline statistics for reference usage to Create baseline statistics for reference usage (2023).Feb 9 2024, 1:41 PM