Page MenuHomePhabricator

[Idea] Collect pageview data using client-side instrumentation
Open, Needs TriagePublic

Description

NOTE: This task documents a frequently-discussed idea. As of June 2025, there is no plan to implement it.

Currently, pageviews and related datasets are derived from the server-side webrequest logs.

As a complement or replacement for this logged pageview data, we could develop a system of client-side instrumentation that allows products to produce a event when they consider a page to be viewed.

Benefits

  • Would reduce the amount of bot traffic included in page views and give more tools for detecting the bots that remain
  • Would allow additions to the page view definition that could only be detected on the client side, like the page being viewed for a certain length of time.
  • Would allow a decentralized pageview definition controlled and evolved by each product, rather than one that is centralized in the cache layer and pattern-based.
  • A smaller stream than webrequest would allow easy stream processing, which could unlock use cases like rapid detection of trending pages
  • (if instrumented page views replace logged) Would save the significant resources involved in filtering Webrequest

Drawbacks

  • Implementation would be a major investment which can simply be skipped if logged page views are or can be made good enough
  • Replacing logged page views would require a major investment in data validation and stakeholder education; complementing logged page views would require juggling and reconciling two different sets of page view defintions, pipelines, and metrics for page views
  • Would require maintaining separate implementations of the pageview definition in each pageview-producing client, compared to the single implementation in the log-based setup
  • Would not include users running some content blockers, users not running Javascript, or requests that are aborted before Javascript runs

Implementation

It would make sense to instrument page views using Experimentation Lab, in order to take advantage of its API, consistent base schema, and easy configuration.

However, we must not use the Experimentation Lab's capability of logging an instrument-specific hashed version of the Edge Uniques cookie for permanent instrumentation, as we have committed to use that capability only for limited experiments.

In a very basic sense, implementation would be straightforward: nothing more than a couple lines of code using the Experimentation Lab API (see here for example) in each of the three pageview-producing clients (MediaWiki, the Android app, and the iOS app) and a similar amount of code to configure the event stream and destination table.

However, to do it properly, there are a number of additional things that would have to happen:

  • Legal and privacy review
  • Assessment of how the CDN, EventGate, and Kafka would handle the extra load
    • EventGate and Kafka could likely handle 100% instrumentation of page views
    • It's possible that CDN could not during major traffic spikes
    • Sampling could be used to address load constraints, although this would add a certain amount of noise and make it somewhat more complicated to use the data
  • Development of additional capabilities for the Experimentation Lab
    • Support for "baseline metrics"/always-on data collection/all-wiki coverage (included in GrowthBook)
    • Bot detection (this is already on the roadmap, as bot traffic can contaminate experiment data)
  • Adaptation of the page view definition to the client-side context
  • Implementation of the full page view definition in each event-producing client
  • Testing and analysis comparing the data to logged page views
  • Implementing data pipelines for aggregation, refinement, and serving
    • Could potentially share much of the code already used for log-based pageviews

Adoption

There would also be substantial challenges in adopting the new type of page view data.

If the intent were for it to replace the current log-based data, there would have to be an even more exhaustive analysis of the differences and a major project to educate the large population of data users throughout the movement about the discontinuity caused by the switch. In addition, the current implementation would likely need to be maintained in parallel for at least a year in order to clearly distinguish real changes in this vital metric from changes caused by the switch.

If the intent were for it simply to complement the current log-based data, this would decrease the risk substantially, but it would also add new challenges. How would we reconcile two distinct metrics? A composite metric would be difficult to develop and confusing to end-users. On the other hand, using both side-by-side would increase the burden on analysts in making sense of the trends and in communicating them clearly. Similarly, other data users like the Pagviews Tool would have to decide whether to choose one definition or support the complexity of two definitions (pushing some of the burden to its own users).

Relevant teams

The following Wikimedia Foundation teams are most relevant:

  • SRE Traffic
    • Manages the CDN infrastructure which would need to handle the additional load
  • Data Engineering
    • Owns the the current log-based page view implementation and would likely own the instrumented page view implementation
    • Would need to manage availability of instrumented page view data in Data Lake tables, dumps, and APIs
  • Experiment Platform
    • Provides the platform capabilities which would likely be used to implement instrumented page views
    • Considers owning/driving the implementation of instrumented page views out of its scope
  • Movement Insights
    • Major user of page view data, highly involved in the page view definition and page view data quality

See also

Please edit this list to keep track of use cases and relevant tasks as they arise.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata renamed this task from [EPIC] Instrument pageviews using events, instead of webrequests to [Epic] Instrument pageviews using events, instead of webrequests.Jul 29 2024, 8:17 PM
Ottomata updated the task description. (Show Details)

Some latest thoughts on this from our slack thread

  • Excited. The biggest reason is a smaller stream of data than webrequest, implying easier stream processing.
  • we initially thought of the JS pageview instrument as an answer to Google Knowledge Graph gobbling up our pageviews. The idea was to make it very easy for content reusers to send us "Content Views" and for us to evolve our pageview definition to align with "Knowledge As a Service". So we were going to expand the schema, call it ContentViews, and make a public endpoint that anyone could post to. This, btw, was one of the initial inspirations for EventPlatform and EventGate. It turns out that Google, Apple, etc. are not super interested in sending us this information. But it might be useful to build it that way, so good actors could send reuse info if they want to. I think our schema would be more complete if it allowed for it.
  • Beyond losing non-JS user agents, which we might be able to instrument differently, we might be losing visits that end before the JS instrument loads. We'd have ample opportunity to compare data we get to webrequest and find this and other anomalies, so we never worried about this too much. If the loss turns out to make the data unusable we can just turn it off and no big deal.
  • I prefer looking at this as a baseline instrument that Data Engineering sets up via xLab. We can do whatever we have to do to make that work. The keys would be:
    • this should not be tied to an experiment
    • no subject ids should be logged anywhere, and the system should prevent these types of baseline instruments from even being able to capture subject ids
    • To me, a pageview instrument is either useful as a core baseline instrument or it should just be a well-understood guard rail that folks can run with their experiments.
  • I've been thinking about bot detection using client-side user activity and models like reCAPTCHA. We can't scale that specifically, but maybe as we evolve we can make our own model. But this determination can be made and sent via x-analytics with the current webrequest log. So I don't think we get any additional bot-detection capabilities with a JS instrument.
  • As for what we capture now that we couldn't capture like this, I started to annotate the schema of pageview_actor with that in mind
  • On pollution of data. There'd be nothing stopping someone from sending a million events from a browser console or something. But there's nothing stopping anyone from loading our pages a million times. The nice thing about pollution in the JS instrument is we may be able to add some user interaction data to the event that allows us to filter out noise.

To unpack my first bullet more. We've thought about stream-processing webrequest for a variety of reasons. It would mainly be cool. It's possible but kind of a big job and the payoff wasn't really there. If we had a purely pageview stream, we'd revisit that and maybe get some cool trending data out of it.

Also documenting my reply from the same thread:

Regarding pageview instrumentation being on Test Kitchen roadmap:
It is not. At most the thing that would be on our roadmap is infrastructure and tooling that enables another team to instrument and experiment on pageviews. The fact that we have temporarily instrumented simple pageviews for the synthetic A/A tests does not mean we would take on the general work of instrumenting pageviews, although we have a lot of expertise on the team about page visibility, code execution, timing of DOM events, etc. and would be a valuable resource to consult.
re: sample rates

Regarding sampling rates:
Sampling rates are going to have to be figured out because we've DDoSed ourselves before. If we were to send a pageview event with every possible pageview right now, there's a chance we bring the whole thing down. Could EventGate and Kafka handle the volume of events in terms of processing? (Pageview events that would have to be processed in addition to all the non-pageview events produced by all the other instrumentation.) We have reason to believe that EventGate and Kafka could, even without throwing more machines at the problem. Could Varnish, etc. handle the volume of requests sent to the analytics intake endpoint (again, in addition to all the non-pageview events being sent)? On most days: probably. On days when we get a new pope: probably not.

Regarding automated/bot traffic:
One really nice benefit that client-side instrumentation gives us that sifting through webrequests doesn't is access to browser APIs for implementing conditions around how long the page was visible for if at all and other checks.

Regarding conducting experiments to increase pageviews:
In FY25/26 WE 3.3, one of the metrics they are interested in is internal referrals and for them to conduct experiments to increase that metric they will need to instrument internal referrals – either by instrumenting all links shown to the user on every pageview or by implementing a mechanism by which to notify an instrument running on a page that the user navigated to that page from another page rather than directly. Anyone interested in this work would do well to connect with efforts in that KR.

Regarding Experimentation Lab and the Varnish VMOD Edge Unique stuff not being set up to accommodate an "all wikis" setup for experiments when it involves the edge unique cookie:
There are optimizations to the code that could help us increase the number of target wikis, although I'm not sure that conducting an experiment on 100 or 200 or 300 or 600 wikis would give you any better insights and improve decision making ability than conducting that experiment on just 50. Especially if that 50 is a diverse and representative set.

Regarding Experimentation Platform capabilities having a role in an instrumented page view system:
Similarly to Dan's

I prefer looking at this as a baseline instrument that Data Engineering sets up via xLab.

with the added point that in the GrowthBook model – which we will begin to thoroughly investigate when we begin our GrowthBook integration system design sprint in September 2025 – from what I've been able to understand from their docs (https://docs.growthbook.io/app/metrics, see also Fact Tables), there is a notion of always-on data collection regardless of experiments and when a client is enrolled in an experiment, all the data collected from that client while they were in the experiment is used for calculating of chosen metrics for the experiment. Part of figuring out the future system architecture is going to involve drafting a cohesive, unified framework for metrics, instrumentation, and experimentation at WMF.

nshahquinn-wmf renamed this task from [Epic] Instrument pageviews using events, instead of webrequests to [Idea] Collect pageview data using client-side instrumentation.Jun 30 2025, 7:41 PM
nshahquinn-wmf updated the task description. (Show Details)

Thank you, @dr0ptp4kt, @mpopov, and @Milimetric, for your super-thoughful responses! I've tried to summarize things in the task description; feel free to add new info or correct any mistakes.

A few questions:

In FY25/26 WE 3.3, one of the metrics they are interested in is internal referrals and for them to conduct experiments to increase that metric they will need to instrument internal referrals – either by instrumenting all links shown to the user on every pageview or by implementing a mechanism by which to notify an instrument running on a page that the user navigated to that page from another page rather than directly.

Interesting! I wonder why this is necessary given that (based on my limited understanding) Document.referrer is available in Javascript? This is quite relevant here as referrer is one of the key dimensions we'd need in general-purpose pageview data.

I've been thinking about bot detection using client-side user activity and models like reCAPTCHA. We can't scale that specifically, but maybe as we evolve we can make our own model. But this determination can be made and sent via x-analytics with the current webrequest log. So I don't think we get any additional bot-detection capabilities with a JS instrument.

Hmm, but that would require having the Javascript run before the web request is sent, no? So at most it could work for internally-referred page views.

Also, a couple of notes that don't belong in the description:

  • @Milimetric and @Ottomata both said they're excited (rather than neutral or skeptical) about this idea. @dr0ptp4kt did not say explicitly but sounded moderately positive. I personally am neutral.
  • @dr0ptp4kt said that, if we do this, it should be as a complement to log-based pageviews rather than a replacement. I very much agree; although having two side-by-side versions of pageviews would add significant complexity to analysis and communications, it's an extremely important metric, so it would make sense to keep log-based pageviews around for additional insight rather than shutting them down.
  • @dr0ptp4kt said that, if we do this, it should be as a complement to log-based pageviews rather than a replacement. I very much agree; although having two side-by-side versions of pageviews would add significant complexity to analysis and communications, it's an extremely important metric, so it would make sense to keep log-based pageviews around for additional insight rather than shutting them down.

We can keep both but we will only generate pageview_hourly and all the downstream datasets from one source. So at first while we figure out the difference between the two, the JS instrument would just be generating something like pageview_hourly_next. If we love the new data we can replace pageview_hourly, but it'll take some communicating.