To better understand how sub-refs are used on the wikis we want to add new aggregations to the scraper:
Per-page metrics
Some of these are intermediate values which were not originally included in the task.
- main_ref_count - count of main refs (excluding subrefs)
- reflist_item_count - total number of references in the references list ( "ref_count - ref_reuse_counts" or count of ref_bodies)
- ratio_subrefs_to_main_refs - ratio of total number of sub-refs divided by number of main refs
- ref_reuse_count - Simple count of reference reuses.
- reflist_subref_item_count - total number of subreferences in the references list (excludes reuses)
- potential_subref_transclusions - list of templates producing sub-refs
- subref_error_counts_by_type - List of each Cite error in subref tags, as a map of how many times each error occurs on the page.
- subref_reuse_count - Number of duplicate subreferences, as detected by taking the difference between footnote marker count and reflist item count.
- subrefs_with_errors_count - Number of subref tags which are marked with an error.
- transclusions_inside_subrefs - list of templates used in sub-refs
Hive
- Add new fields to EventGate export and to draft event schema.
Per-wiki aggregations
- total number of articles with sub-references
- total number of duplicate sub-references (per main ref) per article, per wiki
- list of articles where sub-refs are produced by templates
- list of templates used in sub-refs
- ratio of sub-refs per main ref, for articles having sub-refs
Out of scope
Here we would need to decide on how we want to define these groups
- list of sub-ref contents to allow grouping by type (e.g. page number, video/podcast timestamp, etc.)
Implementation notes
- Create a new test fixture with Parsoid HTML from a subref-containing page, eg https://test.wikipedia.org/api/rest_v1/page/html/CiteDetailsTests and write a basic tests analyzing it, which we can change as new metrics are introduced.
- We can identify subrefs by taking the analyzed ref tags and checking for the details attribute in data-mw.
- Reused subrefs will be compared by text, but will be identifiable by comparing mainRefId once Parsoid supports subref deduplication.
- Once subref reuse is supported, we'll probably want to count subref list items vs. reuses/total subrefs.
Open questions
- What is "ratio of sub-refs" measuring—what's the underlying question about?
- Should we measure the number of subrefs per main ref?
- Should we measure the string length of subrefs?
- What format to use for the "list of articles"? What is the analytics use case?
Implementation
Code for review: