Currently, pageviews and related datasets are derived from the server-side webrequest logs.
As a complement or replacement for this logged pageview data, we could develop a system of client-side instrumentation that allows products to produce a event when they consider a page to be viewed.
Benefits
- Would reduce the amount of bot traffic included in page views and give more tools for detecting the bots that remain
- Would allow additions to the page view definition that could only be detected on the client side, like the page being viewed for a certain length of time.
- Would allow a decentralized pageview definition controlled and evolved by each product, rather than one that is centralized in the cache layer and pattern-based.
- A smaller stream than webrequest would allow easy stream processing, which could unlock use cases like rapid detection of trending pages
- (if instrumented page views replace logged) Would save the significant resources involved in filtering Webrequest
Drawbacks
- Implementation would be a major investment which can simply be skipped if logged page views are or can be made good enough
- Replacing logged page views would require a major investment in data validation and stakeholder education; complementing logged page views would require juggling and reconciling two different sets of page view defintions, pipelines, and metrics for page views
- Would require maintaining separate implementations of the pageview definition in each pageview-producing client, compared to the single implementation in the log-based setup
- Would not include users running some content blockers, users not running Javascript, or requests that are aborted before Javascript runs
Implementation
It would make sense to instrument page views using Experimentation Lab, in order to take advantage of its API, consistent base schema, and easy configuration.
However, we must not use the Experimentation Lab's capability of logging an instrument-specific hashed version of the Edge Uniques cookie for permanent instrumentation, as we have committed to use that capability only for limited experiments.
In a very basic sense, implementation would be straightforward: nothing more than a couple lines of code using the Experimentation Lab API (see here for example) in each of the three pageview-producing clients (MediaWiki, the Android app, and the iOS app) and a similar amount of code to configure the event stream and destination table.
However, to do it properly, there are a number of additional things that would have to happen:
- Legal and privacy review
- Assessment of how the CDN, EventGate, and Kafka would handle the extra load
- EventGate and Kafka could likely handle 100% instrumentation of page views
- It's possible that CDN could not during major traffic spikes
- Sampling could be used to address load constraints, although this would add a certain amount of noise and make it somewhat more complicated to use the data
- Development of additional capabilities for the Experimentation Lab
- Support for "baseline metrics"/always-on data collection/all-wiki coverage (included in GrowthBook)
- Bot detection (this is already on the roadmap, as bot traffic can contaminate experiment data)
- Adaptation of the page view definition to the client-side context
- Implementation of the full page view definition in each event-producing client
- Testing and analysis comparing the data to logged page views
- Implementing data pipelines for aggregation, refinement, and serving
- Could potentially share much of the code already used for log-based pageviews
Adoption
There would also be substantial challenges in adopting the new type of page view data.
If the intent were for it to replace the current log-based data, there would have to be an even more exhaustive analysis of the differences and a major project to educate the large population of data users throughout the movement about the discontinuity caused by the switch. In addition, the current implementation would likely need to be maintained in parallel for at least a year in order to clearly distinguish real changes in this vital metric from changes caused by the switch.
If the intent were for it simply to complement the current log-based data, this would decrease the risk substantially, but it would also add new challenges. How would we reconcile two distinct metrics? A composite metric would be difficult to develop and confusing to end-users. On the other hand, using both side-by-side would increase the burden on analysts in making sense of the trends and in communicating them clearly. Similarly, other data users like the Pagviews Tool would have to decide whether to choose one definition or support the complexity of two definitions (pushing some of the burden to its own users).
Relevant teams
The following Wikimedia Foundation teams are most relevant:
- SRE Traffic
- Manages the CDN infrastructure which would need to handle the additional load
- Data Engineering
- Owns the the current log-based page view implementation and would likely own the instrumented page view implementation
- Would need to manage availability of instrumented page view data in Data Lake tables, dumps, and APIs
- Experiment Platform
- Provides the platform capabilities which would likely be used to implement instrumented page views
- Considers owning/driving the implementation of instrumented page views out of its scope
- Movement Insights
- Major user of page view data, highly involved in the page view definition and page view data quality
See also
Please edit this list to keep track of use cases and relevant tasks as they arise.
- T368303: REQUEST: Add Special:AllEvents to allowlist for campaigns-product pageview tracking
- T240676: Develop a consistent rule for which special pages count as pageviews
- T304362: Pageview definition relies on X-Analytics to determine special pages
- T113817: Add request_id to webrequest logs as well as other event records ingested into Hadoop
- T310732: "Source of truth" dataset for pageviews
- T336361: [Analytics] [MOB EDIT M1] Identify access from mobile vs. desktop devices
- T346463: Identify and label prefetch proxy data in our traffic
- T366004: Add page-title to the x_analytics header
- T325544: Update refinery-source PageviewDefinition to better handle `Special:` pages
- T184793: [EPIC] Instrument page interactions
- T186728: Record and aggregate page previews
- T329471: Measuring reader visits to core project pages (on Vector 2022)