Page MenuHomePhabricator

[SPIKE?] Attribution API MVP: Determine how to add 'top' trending signal to the attribution trust_and_relevance object
Closed, ResolvedPublic5 Estimated Story Points

Description

Description

The concept of whether or not content is ‘trending’ is increasingly common on Search and AI reuse scenarios. Increases in activity relative to other pages often correlate with current events. For that reason, this signal is useful to communicate the timely relevance of articles and to highlight Wikipedia as a source of both popular and up-to-date information.

Within the concept of trending, there are also two flavors: The notion of the top articles within WIkimedia as a whole, as well as highlighting when pages are trending relative to their baseline performance. For the scope of this ticket, we will focus on the first element, which will provide context about whether or not the given article is in the 'top' category for editing, reading, or both.

Conditions of acceptance

  • Determine a path for implementing the 'top' signal, within a trust_and_relevance.trending object (see recommended object structure below)
  • Pages should be tagged with:
    • top_read: true --> Returned if the requested page is in the top read category for the project
    • top_edited: true --> Returned if the requested page is in the top edited category for the project
    • top_read_and_edited: true --> Returned if the requested page is in both the top read and top edited categories for the project
  • Return false if the pages is not in any of the given categories
  • Determine how to best cache/store this information so that it is not pulled on every request
    • AQS updates data every 24 hours (confirm this with DPE), meaning the 'top' indicators will not change more frequently than that
    • Calculate top for a per-day window (eg: Top read yesterday)
  • Document recommended path to review with team

Implementation details

The notion of "Top" currently exists within the AQS Analytics API:

Potential tech approach:

  • Implement as a maintenance script on daily cron job & store within Attribution stash
  • Iterate through wiki projects

Suggested response payload (but open to whatever other structure/recommendations):

"trust_and_relevance": {
  "last_modified": "2026-02-22T15:39:51Z",
  "page_views": 470381,
  "contributor_counts": 0,
  "trending": {
    "top": {
        "read": boolean,
        "edited": boolean,
        "read_and_edited": boolean
    },
    "relative": {
        "read": boolean,
        "edited": boolean,
        "read_and_edited": boolean
     }
  }
},

Event Timeline

HCoplin-WMF renamed this task from [SPIKE?] Determine how to add 'top' trending signal to the attribution trust_and_relevance object to [SPIKE?] Attribution API MVP: Determine how to add 'top' trending signal to the attribution trust_and_relevance object.Feb 26 2026, 4:38 PM
HCoplin-WMF updated the task description. (Show Details)
HCoplin-WMF set the point value for this task to 5.
HCoplin-WMF raised the priority of this task from Medium to High.
HCoplin-WMF moved this task from Incoming (Needs Triage) to To Refine on the MW-Interfaces-Team board.

Minor tweak here: Changing from 7 day look back window to "for the last day" essentially. Should simplify the approach slightly, and simplifies how we talk about it.

Bringing back to sprint as this was an unintentional change, didn't know that adding MW-Interfaces-Team tag would remove the sprint tag.

Note from attribution sync:

  • AQS is a Go service; there is no PHP interface we can clearly hook into. We need to figure out how we can make a connection to pull the data.
  • page-views implementation might do something similar: https://www.mediawiki.org/wiki/Extension:PageViewInfo.
    • We might be able to pull the data from here, as well, maybe? One implementation approach might be to add here if we can't get it another way.
  • We cannot make a round trip request to the HTTP API for this; it is too expensive and slow.
  • Andrew Otto (@Ottomata ) can help answer questions about AQS :)

TY! I really look forward to having time to focus on this with you all ASAP! Cc also @GGoncalves-WMF :)

Hey @Ottomata!
I have a couple of questions where your expertise on AQS would be really helpful:

  • AQS update frequency — can you confirm that the top-read and top-edited endpoints update every 24 hours? I want to make sure we size our cache TTL and cron job frequency correctly.
  • Connecting to AQS internally — I'm wondering about building a dedicated service to call AQS, separate from PageViewInfo, so we'd be calling AQS via the service mesh directly rather than going through rest-gateway. I noticed the page-analytics listener has already been added to the service mesh (T411771). Is that the right path for calls to the metrics/pageviews/top endpoint, and is there anything we should be aware of before going down that route?

Thanks so much in advance! 🙏

cc. @GGoncalves-WMF

Hello! I will start by asking you a question because now I am curious!

Attribution stash

What is the attribution stash? I'm currently exploring storage options for serving in MW in Attribution API derived data design work.

can you confirm that the top-read and top-edited endpoints update every 24 hours?

Hi Andrew! I've just read your response and the references.

  • On the Attribution stash — great question! I think it was used loosely in the ticket to refer to a cache store. After reading through your storage design doc, I think our use case for the trending signal is actually much simpler than the contributor counts problem you're exploring there. We don't need per-page storage at scale — we just need to store two small lists of page titles per project (top-read and top-edited), pre-computed nightly by a cron job. BagOStuff/Redis feels like the right fit for this: write once a day, fast key-value lookup at runtime.
  • On top-read — great, confirmed that daily works perfectly for our use case.
  • On top-edited — this is really useful to know, thank you. The monthly granularity changes my assumptions considerably. Based on this, I'm in doubt about either omitting the top_edited field or returning it as null for now, until we have a daily data source consistent with top-read — otherwise I think (if I'm not wrong) we'd end up with an asymmetry in the signal that could be confusing for API consumers. I'll align with the team on the final call.

That said, one other little question in order to build a better idea / thought: looking at the AQS edit analytics docs, I noticed that the top-by-edits, top-by-net-bytes-difference and top-by-absolute-bytes-difference endpoints all accept a specific {year}/{month}/{day} parameter. Does that mean they support daily granularity at the API level, even if the underlying dataset updates monthly? And if daily data is accessible through one of these endpoints, which would you recommend for a top edited signal?

Thank you so much for your support!

a specific {year}/{month}/{day} parameter. Does that mean they support daily granularity at the API level, even if the underlying dataset updates monthly?

Reading docs and remembering...

Yes, it is updated monthly but available at monthly and daily granularities.

You can request top edits in a specific month (by setting `day to 'all-days'), or top edits in a specific day. The reason the endpoint params are different than others is that 'top' is not an additive metric, so it can't be summed or presented as a timeseries in the same way as other metrics. I.e. SUM(7 days of top edits) != top edits for a week. Each ranking window is distinct.

And if daily data is accessible through one of these endpoints, which would you recommend for a top edited signal?

You want to rank by number of edits, yes? If so, then top-edited sounds like what you want. If you want to rank by byte count changes, then you could do one of the other two.

until we have a daily data source consistent with top-read

Please submit this as a need / request to DPE, perhaps even by adding another request in the existent Derived Data Prep Pantry intake request. We have a good plan (T406069#11247231) on how to update this data daily, we just need an explicit justification to do it.

we just need to store two small lists of page titles per project (top-read and top-edited), pre-computed nightly by a cron job. BagOStuff/Redis feels like the right fit for this: write once a day, fast key-value lookup at runtime.

Makes sense! BTW, the reason I ask is: I'd really like to solve the common problem of how to (easily) get data from outside of MW into MW so it can serve things. In this case, there AQS is a public API you can use, so you have a pretty easy and supported way to get it. In many other cases, things are more complicated.

write once a day, fast key-value lookup at runtime.

Naive pondering: is this needed given that AQS responses itself are cached by the CDN? I guess you'd like to avoid any request-time misses?

Anyway, I realized I didn't answer your other question!

Connecting to AQS internally — I'm wondering about building a dedicated service to call AQS, separate from PageViewInfo

I would love to think about this more holistically as part of OW6.2 in the next FY. I have an uninformed opinion that DPE should own some kind of 'Data Product / Derived Data / External Data Access' extension that could provide PHP APIs to help us with these common problems. We need to do more platform product research to be sure though. See also T414109: Technical assessment of AQS framework, specifically T414109#11523817.

I'm not that familiar with PageViewInfo, but given that it has a WikimediaPageViewService, adding a MW service for edit-analytics API makes sense to me. It doesn't jive with the extension name, but it seems like this extension should be renamed to something more general (AQS client, WikimediaAnalyticsAPIClient, whatever :) ). Maybe just add the Service and put a BUNCH of comments and a nice README saying "ah we should rename this and also we need to think bigger soon"?

:)

Naive pondering: is this needed given that AQS responses itself are cached by the CDN? I guess you'd like to avoid any request-time misses?

Exactly — the goal is to avoid any HTTP hop at request time entirely, even a CDN-cached one. A guaranteed local lookup felt like the safest approach for the MVP.

Yes, it is updated monthly but available at monthly and daily granularities.

Understood! For the MVP we can likely proceed with the monthly granularity for now and flag the asymmetry with top-read clearly in the API. Not ideal, but workable as a first iteration. I'll talk to the team and start a discussion :)

Please submit this as a need / request to DPE, perhaps even by adding another request in the existent Derived Data Prep Pantry intake request. We have a good plan (T406069#11247231) on how to update this data daily, we just need an explicit justification to do it.

Really good to know - I'm thinking that we can definitely submit a formal request once the MVP is in place. That feels like the natural next step to extend and consolidate things properly.

Maybe just add the Service and put a BUNCH of comments and a nice README saying "ah we should rename this and also we need to think bigger soon"? :)

That's honestly very much in the spirit of what we're trying to do here, I totally agree :) — land on the simplest viable thing for the MVP, and leave clear breadcrumbs for the bigger holistic solution you're envisioning for OW6.2. Really looking forward to seeing where it goes in future iterations! :)

Spike Analysis — Proposed path for implementing trending.top


The core problem

Trending data comes from the AQS Analytics API, which exposes two periodically updated lists:

  • Top 1000 most-read pages per project — updated every 24 hours
  • Top 100 most-edited pages per project — updated monthly (see top-edited section)

Fetching these lists in real time on every Attribution API request is not feasible: it would mean downloading large lists on every single request, with an unacceptable impact on latency and load.


Proposed solution

Split the problem into two distinct moments:

1. Data collection (asynchronous, nightly)
A scheduled process — running as a nightly cron job — queries AQS once for all Wikimedia projects, builds the sets of trending page titles, and stores them in a distributed cache (Redis/BagOStuff), with a 48-hour TTL to ensure continuity in case the nightly job fails.

2. Runtime lookup (synchronous, very fast)
When the Attribution API receives a request, it simply reads from the pre-populated cache to determine whether the requested title appears in the top_read and/or top_edited sets for the relevant project. No external calls, no significant latency added.


Proposed architecture

A dedicated service and a nightly maintenance script will be created inside the WikimediaCustomizations extension to handle both the nightly AQS calls and the cache lookups at runtime. Redis/BagOStuff has been confirmed as the appropriate storage approach: two lists of titles per project, written once a day by a cron job and read quickly at runtime.

For the connection to AQS, the service mesh will be used — the same path already in use by PageViewInfo. The new service will be integrated into the Attribution API following the same pattern: registered as an optional service in the MediaWiki service container, and injected into the data builder where it constructs the trending sub-object to be added to trust_and_relevance.

The suggestion of extending PageViewInfo itself with an edit-analytics service is noted as a possible future consolidation path, but is out of scope for the MVP.


Signal structure

The new trending.top object inside trust_and_relevance will contain:

  • top_read → true if the article is among the top-read pages
  • top_edited → true if the article is among the top-edited pages
  • top_read_and_edited → true if it appears in both lists

All values return false if the article is not in the given category. The entire trending object is null if data is not yet available in cache (e.g. first run, empty cache).


Project scope

The MVP prioritizes Wikipedia, Commons and Wikidata.

However, the implementation should be designed to cover all projects supported by AQS from the start, since the additional cost would be minimal if the design is generic enough — meaning the cron job iterates over all available projects without hardcoding a specific list.


The top-edited data — current limitation and path forward

It has been confirmed with DPE that top-edited data updates monthly, not daily. The AQS endpoint accepts daily parameters but the underlying dataset remains monthly.

This creates an asymmetry with top_read, which is daily.

MVP proposal:
Expose the top_edited field using the monthly data already available from AQS. This gives us a usable signal from day one, with the understanding that the underlying data updates monthly rather than daily. The asymmetry with top_read will be clearly documented so API consumers are aware of it.

Post-MVP:
Submit a formal request to DPE to justify prioritizing daily loading (T406069). A solid technical plan already exists — it just needs an explicit justification from our side to be unblocked. Once daily data is available, the signal can be updated and consolidated without any changes to the API contract.

Closing out MW-Interfaces-Team (MWI-Sprint-29 (2026-03-10 to 2026-03-24)); marking everything from that sprint as resolved.

https://phabricator.wikimedia.org/project/board/8573/