In T241832, I discovered that it's prohibitively expensive to calculate some basic metrics for Cite reference usage. The basic problem is that we cannot rely on wikitext dumps, because <ref> tags are often produced using templates which we cannot expand without a full parse. Further downstream, it would be possible to scrape the metrics from rendered HTML, but that's expensive to retrieve.
This task proposes that we gather the metrics directly from the Cite extension during parse. We can expose the data in ParserOutput, and send to EventGate in a hook.
- Number of footnote marks rendered in an article. This includes special code for "zero" references, needed because the Cite extension would otherwise not hook in.
- Number of references rendered in references lists.
- Number of reference groups used in an article.
Maybe there's an existing data pipeline that we can integrate with?
Don't know what to do about the backfilling problem, since not all articles will be retrieved or parsed. We can track this absence of data for now, and backfill later.
The resulting data set is probably useful to researchers, and we should find a way to integrate the stats so they're easy to query. Would this belong in a shared feature store?