Page MenuHomePhabricator

[5.3.3 Epic] Provide reference count information within Attribution endpoint.
Open, Needs TriagePublic

Description

Having a well cited page with a diverse set of resources unsurprisingly raises trust with readers. The purpose of this signal is to give reusers the opportunity to surface the number of citations to build confidence in the content. The listening tour conducted by the design team earlier in 5.3 confirmed this as one of the most persuasive signals for readers to build trust in Wikimedia content.

Conditions of acceptance
  • Create a technical design document that proposes at least one approach for tallying references within an article.
  • Verify the design approach with MWI, MWP, and Amir S.
  • Implement the recommended method for counting the number of references per page.
    • The reference count should reflect the number of unique sources on the page; if the same source is used for multiple references, count it only once.
    • Reference count is not returned on media files; it is only used for articles.
  • Add a new field to the "trust_and_relevance" object returned by the signals endpoint to return this value: "reference_count": "string"
  • [Stretch] Audit existing English Wikipedia pages for the distribution of references. Knowing the average, max, and distribution of the number of references will help design determine appropriate thresholds for returning opaque strings instead of explicit values.
Implementation details

For now, return the value as the actual reference count, as a string. We should implement it as a string because we may update the logic to conditionally return alternative messages at certain count thresholds (for example, returning "more than 5" or "5+" if there are 5 or more, but fewer than 10 references.

Notes
    • We are waiting for the design to confirm this approach and specific thresholds that we would like to see.
  • DPE would like to be involved in the design of this work. Because reference count may be expensive to calculate on the fly, it might make more sense to set up a derived data pipeline in partnership with DPE.
Open questions:

If reference count is hard to get, do we need to keep that as trust_and_relevance property? Maybe this could be a another signal.

Event Timeline

Just linking to T407026: Investigation: Add more data points to Contributions tab (editing basics), which includes a discussion with the Connection Team about how to track newly-added references to their pages of interest. Slightly different goal but ultimately same question of how to efficiently and accurately count references.

HCoplin-WMF renamed this task from Provide reference count information within Atribution endpoint. to [5.3.3 Epic] Provide reference count information within Attribution endpoint..Feb 19 2026, 2:13 AM
HCoplin-WMF updated the task description. (Show Details)
curl -s https://api.wikimedia.org/service/lw/inference/v1/models/articlequality:predict -X POST -d '{"rev_id": 1338010645, "lang": "en", "extended_output": "true"}' -H "Content-type: application/json" | jq .
{
  "label": "B",
  "features": {
    "raw": {
      "characters": 33550,
      "refs": 134,
      "wikilinks": 219,
      "categories": 8,
      "media": 6,
      "headings": 29,
      "sources": 120,
      "infobox": false,
      "messagebox": false
    },
    "normalized": {
      "characters": 1,
      "refs": 1,
      "wikilinks": 1,
      "categories": 0.5333333333333333,
      "media": 1,
      "headings": 1,
      "sources": 1,
      "infobox": false,
      "messagebox": false
    }
  },
  "score": 0.8565797301464592,
  "model_name": "articlequality",
  "model_version": "1",
  "wiki_db": "enwiki",
  "revision_id": 1338010645
}
  • More generally, many data points like this (ML world calls these 'features' if they are inputs to models) are needed by many different things, both offline analysis (data lake) and online products (APIs / user features). Wouldn't it be nice if these were computed once per revision and reusable for all those use cases? ;)
  • More generally, many data points like this (ML world calls these 'features' if they are inputs to models) are needed by many different things, both offline analysis (data lake) and online products (APIs / user features). Wouldn't it be nice if these were computed once per revision and reusable for all those use cases? ;)

Also overlapping with https://enterprise.wikimedia.com/api/structured-contents/ and my own team's HTML scraper for detailed stats about Cite <ref> usage.

Sign me up for this proposed unified feature store, I'd love to maintain fewer snowflakes!