Page MenuHomePhabricator

Instrument Cite to record the nubmer of footnote marks and references list entries rendered in each article
Closed, DeclinedPublic

Description

In T241832, I discovered that it's prohibitively expensive to calculate some basic metrics for Cite reference usage. The basic problem is that we cannot rely on wikitext dumps, because <ref> tags are often produced using templates which we cannot expand without a full parse. Further downstream, it would be possible to scrape the metrics from rendered HTML, but that's expensive to retrieve.

This task proposes that we gather the metrics directly from the Cite extension during parse. We can expose the data in ParserOutput, and send to EventGate in a hook.

Desired metrics:

  • Number of footnote marks rendered in an article. This includes special code for "zero" references, needed because the Cite extension would otherwise not hook in.
  • Number of references rendered in references lists.
  • Number of reference groups used in an article.

Maybe there's an existing data pipeline that we can integrate with?

Don't know what to do about the backfilling problem, since not all articles will be retrieved or parsed. We can track this absence of data for now, and backfill later.

The resulting data set is probably useful to researchers, and we should find a way to integrate the stats so they're easy to query. Would this belong in a shared feature store?

Event Timeline

Change 562318 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/Cite@master] [WIP] Record Cite usage during parsing.

https://gerrit.wikimedia.org/r/562318

@Miriam This might relate to your citation usage research. Feedback welcomed!

Would this belong in a shared feature store?

This seems very useful, it should include page_id and revision_id as well in order to be able to be queried by either.

Hi @awight ! I believe @tizianopiccardi has worked on something similar for our citation usage project, using the HTML of the articles. Maybe part of the code can be reused or part of the data used as backfill. Maybe he can help :)_

Interesting! I like this idea. This could either be added as new field(s) in the revision-create event, or we could create a new event similar to the page-links-change event.

Hi @awight ! I believe @tizianopiccardi has worked on something similar for our citation usage project, using the HTML of the articles. Maybe part of the code can be reused or part of the data used as backfill. Maybe he can help :)_

Wonderful, I thought there was something like that but could not manage to find it yet.

This could either be added as new field(s) in the revision-create event

That sounds ideal, but I don't see any way to hook into ext-EventBus or add custom attributes to the event. I can set extension data on ParserOutput, but it's not clear how to wire beyond that.

Hm, yeah you'd have to add this to revision-create, you'd have to somehow add this info to the PageContentSaveComplete hook params that the revision-create is event is fired by.

Probably a new stream is easier (and maybe preferable).

I'm removing the Research tag as that's the one we use to track our team's tasks in Phabricator. Feel free to ping us with specific questions. :)

As the author, I think this task is overengineering. These statistics can be scraped just as well from HTML, they belong in a feature store. I don't see any need for real-time streaming of summary numbers, for example potential cite-bots would be interested more in specific diffs in <ref> and <references>.