We want to have a number of metrics and statistics for usages of references so we check the impact of the changes we are going to make. See also spreadsheet.
- How impactful might work on citations be? (How important are references to Wikipedia? How prevalent are they?)
- How many references (ref tags) exist per page on average per wiki, etc
- How many footnote markers appear in each rendered page?
- How many footnote bodies render to a Cite error?
- How many of the ref footnote tags come from a list-defined reference?
- How many ref footnotes are produced under a template transclusion?
- How many <references /> sections are rendered on the page?
- How often are references reused using current available methods?
- How many ref tags are a pure reuse, eg. just a "name" and no body?
- How many template transclusions were made on the page?
- How often are the different templates used for reusing references invoked e.g. x% of pages on dewiki use the sfn template - replicating Citation templates across different wikis
- What templates produce refs for each wiki?
- How often are named references used per wiki on average e.g. x% of pages on dewiki use named references
- How many different ref “name”s are defined?
- How often references may benefit from being reusable in a more flexible/easier way?
- How often are references reused using existing methods, that could be replaced with a simpler approach e.g. x% of references on a page on dewiki are an instance of reuse of a reference that already exists on that page
Implementation: Scrape rendered HTML dumps
We've discovered that the Wikimedia Enterprise HTML dumps give us a reliable way to analyze ref tags in their final rendering. This is challenging with any other data source, since ref tags are often produced by templates and therefore don't appear in the wikitext.
Our proof-of-concept scraper can be found here: https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/