Page MenuHomePhabricator

Identify LLM-mediated requests in our pageview data products
Open, Needs TriagePublic

Description

P&T senior leadership wants to track LLM-mediated traffic (retrieval-augmented generation, RAG for short) in relation to other pageview dimensions to make decisions on whether product interventions should/shouldn't cater to this access path specifically, and to inform (potential) donor conversations based on the demand for Wikimedia content. WME has also expressed a similar desire in the past to do market intelligence and business development outreach.

A few of the major players self-identify their retrieval traffic via User-Agent and IP addresses (here's a survey).

This is different, but related to, our recent refresh of AI chatbot referrers (T406531): that ticket allows us to tracks a click through an LLM that turns into a pageview, whereas this ticket is about an AI agent who's looked up a Wikipedia page to formulate a response to a user. We can probably approach it similarly though: a collaboration between Movement Insights and Data Engineering, with the former driving finer requirements and alignment to business needs, and the latter reflecting those into the actual data.

Acceptance criteria:

  • Analysts can identify and query RAG requests for a set of priority AI agents in pageview_hourly and pageview_daily.

Next steps:

  1. Determine the list of AI agents to prioritize (MI)
  2. Determine how to model this kind of traffic (Is RAG a subtype of spider? How do we represent it?) (DE + MI)
  3. Identify the priority AI agents by User-Agent and represent them in the pageview data (DE)

Out of scope:

  • Using IP lists for classification. Those are partially collected at the edge and used in the X-Provenance header, but not currently stored in the data platform.
  • Using client-side sigals for further accuracy and detection of agents that don't self-identify; for the above use cases, this is currently considered to bring only diminishing gains.