Page MenuHomePhabricator

Attribution API MVP: Provide initial-pass base reference count
Closed, ResolvedPublic3 Estimated Story Points

Description

We need to find a good way to properly count and display reference counts per article; that discussion is ongoing in the parent task, and should continue.

For the moment, we can implement basic functionality with relatively high accuracy (and some outliers) as a temporary measure that is still performant. This should only be done for articles, not image pages.

  • Fetch existing Parser HTML for the given article
  • Use Regular Expression to count the number of iterations of the substring mw:Extension/ref

As pointed out by Subbu, this will give us correct information 99% of the time, where the 1% falsehoods would come from tutorial pages that have that string directly in the content. Those pages are less likely to be utilized in the Attribution API and are very few, which should be acceptable for a first pass.

Implementation details

  • This substring is available only new Parsoid.
  • Count should be returned only for Articles, not for Files

Event Timeline

We tallked about this a bit with @AGhirelli-WMF - and our first idea was to put that ref count directly in CommonsMetadata repository so everyone could start using it. But after deeper thought - that would only enable this value for Files stored on Commons -> but it wouldn't be provided when querying articles.

Therefore we decided that the best place to implement this feature would be the WikimediaCustomizations repo.

pmiazga updated the task description. (Show Details)

We do not want reference count to be included on media files at all, as a reminder. It's the last bullet on the ticket, and is documented in the expected fields here: https://docs.google.com/spreadsheets/d/1-Ww-gfmS-HXd6ozux6Qv_ioOBlLICj8Oa4TWDlkyxHQ/edit?gid=508518853#gid=508518853

Change #1248618 had a related patch set uploaded (by Aghirelli; author: Aghirelli):

[mediawiki/extensions/WikimediaCustomizations@master] AttributionRestHandler: add reference_count to trust_and_relevance

https://gerrit.wikimedia.org/r/1248618

FYI looks like you all are counting the number of in-line citations on the page and not the number of unique sources as indicated in the description for T417669.

Thanks for flagging, @Isaac

Is it easy for us to dedupe unique sources once we have the list?

This actually uncovered some consistency/potential confusion -- @Sarai-WMF , would you mind confirming your expectation here? The framework itself uses both "references" and "sources", which would be counted differently. I assumed the "number of sources" as being the unique count, vs multiple references which may come from the same source.

Specifically from the framework:

References are one of the pillars of Wikipedia’s reliability, helping ensure that facts and data can be independently verified by readers.

Reference count is a Wikipedia-specific signal. It indicates the number of sources supporting a Wikipedia article (e.g., “23 sources) at the time of citation or reuse.

EDIT: One other option for MVP here: We return both; we could start with reference count, then add source count later as an additional object property (instead of just a single int).

Yes but not in the context of a regex I think. For my Python library, we do <li> elements whose ID starts with cite_note- (code) but that again is a feature that depends on parsing the HTML (and not just treating it like a string of characters). Maybe a simpler version is to just regex on id="cite_note- but you'd have to do some testing on that I think to make sure it's accurate. I know this is an MVP so I'll say once and then drop it but ideally you all aren't computing this yourself and are instead pushing for something like what @Ottomata described in T417669#11633128 of a standardized place where this data is computed and stored for quick lookups as there are a number of other use-cases that could benefit from things like reference counts. If you haven't, I'd also talk with the Connection team who have the exact same challenge of getting reference counts for articles via PHP code: T407026.

Yeah it won't be possible to get unique references from the regular expression search; we'll need to do something a lot heavier, and parse the HTML (which is not performant on every request; so we'll need to find a way to do that that's more performance, potentially in the save operation if needed).

Maybe a simpler version is to just regex on id="cite_note- but you'd have to do some testing on that I think to make sure it's accurate.

That's actually a good idea for at least a "first pass approximation for the MVP", it won't give us unique *sources* but it *should* cover cases where the reference appears more than once in a page (which happens fairly often). @AGhirelli-WMF -- you should test the result of *that* regular expression, it might be better.

I'd also talk with the Connection team who have the exact same challenge of getting reference counts for articles via PHP code: T407026.

We should definitely chat about this and see if we can have something for the future for both use cases here. If we need more accurate reference count for both cases, it sounds like we should align and see if we can plan/find a solution that is unified.

Just to clarify, though -- that is outside the scope of this *immediate* and specific ticket. As in, @AGhirelli-WMF don't worry about this for the moment :)

Confirming that we're already talking to the DPE team about a longer term solution, too. I can reach out to Connection to include them on the talks and see if it's reusable.

for what it's worth, I think there are actually three things you could be referring to but there is no consistent vocabulary so you're not the first to get tripped up:

  • in-line citations -- e.g., [1]. What's tagged with mw:Extension/ref.
  • reference -- i.e. the unique source behind [1]. may be multiple in-line citations per reference. What I'm referring to above.
  • general sources (?) -- i.e. you can have a reference in an article with zero in-line citations. It won't be caught by the logic I mentioned above of cite_note-.

And to make it more confusing, some references will have a <cite> tag when they're generated with a proper citation template but not all do. en:Anarchy is a favorite example of mine for these things. Currently at 82 references and 91 in-line citations and a bunch more general references in the Bibliography and Further Reading sections.

That's actually a good idea for at least a "first pass approximation for the MVP", it won't give us unique *sources* but it *should* cover cases where the reference appears more than once in a page (which happens fairly often).

I think in theory the id="cite_note- should only appear in references (bullet point 2) and work across any language but I'd double-check e.g., what happens if the citation name has an apostrophe in it or something like that that could trip up the regex.

If we need more accurate reference count for both cases, it sounds like we should align and see if we can plan/find a solution that is unified.
Confirming that we're already talking to the DPE team about a longer term solution, too.

Thanks!

Sorry, I've just saw the thread.
I've just implemented the solution you are proposing @Isaac and the results are better then the original one. I'm going to push the new change on the patch right now.

Thank you all for all the thoughts and your support!

This actually uncovered some consistency/potential confusion -- @Sarai-WMF , would you mind confirming your expectation here? The framework itself uses both "references" and "sources", which would be counted differently. I assumed the "number of sources" as being the unique count, vs multiple references which may come from the same source.

Specifically from the framework:

References are one of the pillars of Wikipedia’s reliability, helping ensure that facts and data can be independently verified by readers.
Reference count is a Wikipedia-specific signal. It indicates the number of sources supporting a Wikipedia article (e.g., “23 sources") at the time of citation or reuse.

EDIT: One other option for MVP here: We return both; we could start with reference count, then add source count later as an additional object property (instead of just a single int).

Hey @HCoplin-WMF! I'm really sorry I missed your ping. In the context of the Attribution framework, we are indeed using the terms "sources" and "references" interchangeably. Our intention was to simplify comprehension for external audiences, but I acknowledge the terms aren't really identical in our context, and thus this can create confusion. My apologies for that. Ideally, we'd like to provide a count of unique sources behind citations, but that data would not be easily verifiable by users. So, to base the signal on the information explicitly provided on-wiki and keep data easy to map, we'd like the count to match the number of references, as expressed in the References section of an article. Probably unnecessary image for reference:

Screenshot 2026-03-12 at 12.44.11.png (1×2 px, 831 KB)

I see how the quoted definition (and that signal's documentation in the attribution framework) needs to be updated to improve specificity. We'll update the content asap.

HCoplin-WMF set the point value for this task to 3.

Change #1251280 had a related patch set uploaded (by Aghirelli; author: Aghirelli):

[mediawiki/extensions/WikimediaCustomizations@master] Attribution: update copy for author and license i18n properties

https://gerrit.wikimedia.org/r/1251280

Change #1251280 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCustomizations@master] Attribution: update copy for author and license i18n properties

https://gerrit.wikimedia.org/r/1251280

Change #1248618 merged by jenkins-bot:

[mediawiki/extensions/WikimediaCustomizations@master] AttributionRestHandler: add reference_count to trust_and_relevance

https://gerrit.wikimedia.org/r/1248618

Closing out MW-Interfaces-Team (MWI-Sprint-29 (2026-03-10 to 2026-03-24)); marking everything from that sprint as resolved.

https://phabricator.wikimedia.org/project/board/8573/