Maniphest T190174

descriptive metric: traffic distribution
Open, LowPublic
Actions

Assigned To

None

Authored By

	• JKatzWMF
	Mar 20 2018, 3:50 PM

Description

[Placeholder]
We often conflate # of pages with size of impact and this is a really unfortunate proxy, because of the uneven distribution of page values. One example might be species or genes, of which there are many many many pages, but each page gets minimal traffic.

When we sit down to consider it, we intuitively assume that traffic on Wikipedia follows a power law distribution among articles, with some small % of articles representing the vast majority of views. However, the precise breakout is unknown. Another unfortunate proxy we use is to look at the top 100 pages (which are mostly celebrities, media and recent events) and to deduce that this means the majority of pageviews belong to these topic areas. Depending on the shape of the distribution, this might be far from true.

Simple approach: Plot distribution

Per all the filters available to pageviews (platform, referrer, country, etc), we should establish what the distribution of pageviews is. Do the top 100 pages represent 5% of traffic or 30%? Aside from knowing this, it does not help us make many product-specific decisions

Robust approach: Per-dimension %

A more useful approach might be to provide a table whereby one could choose a category or other content qualifier (article class, article length, # of media files, # of inbound/outbound links, network centrality, etc) and then find out what % of pages and what % of pageviews they represent (again, filterable by the dimensions available for pageviews platform, referrer, country, etc).

Output mode

In either case, a pivot or superset implementation of the tool would be preferred so that PMs and analysts can drill down on their own and data analysts can focus on more advanced work

Related Objects
Search...

Status	Assigned	Task
Declined	None	T298924 Superset - Product Analytics Canonical Dashboards, Reports, and Datasets
Open	kzimmerman	T234701 "Content" equivalent of pageviews daily or edits_hourly available to use in Turnilo and Superset
Declined	None	T190113 Relationship between content and traffic by wiki
Open	None	T190174 descriptive metric: traffic distribution