P&T senior leadership wants to track LLM-mediated traffic (retrieval-augmented generation, RAG for short) in relation to other pageview dimensions to make decisions on whether product interventions should/shouldn't cater to this access path specifically, and to inform (potential) donor conversations based on the demand for Wikimedia content. WME has also expressed a similar desire in the past to do market intelligence and business development outreach.
A few of the major players self-identify their retrieval traffic via User-Agent and IP addresses (here's a survey).
This is different, but related to, our recent refresh of AI chatbot referrers (T406531): that ticket allows us to tracks a click through an LLM that turns into a pageview, whereas this ticket is about an AI agent who's looked up a Wikipedia page to formulate a response to a user. We can probably approach it similarly though: a collaboration between Movement Insights and Data Engineering, with the former driving finer requirements and alignment to business needs, and the latter reflecting those into the actual data.
Acceptance criteria:
- Analysts can identify and query RAG requests for a set of priority AI agents in pageview_hourly and pageview_daily.
Next steps:
- Determine the list of AI agents to prioritize (MI)
- Determine how to model this kind of traffic (Is RAG a subtype of spider? How do we represent it?) (DE + MI)
- Identify the priority AI agents by User-Agent and represent them in the pageview data (DE)
Out of scope:
- Using IP lists for classification. Those are partially collected at the edge and used in the X-Provenance header, but not currently stored in the data platform.
- Using client-side sigals for further accuracy and detection of agents that don't self-identify; for the above use cases, this is currently considered to bring only diminishing gains.