Page MenuHomePhabricator

descriptive metric: traffic distribution
Open, LowPublic


We often conflate # of pages with size of impact and this is a really unfortunate proxy, because of the uneven distribution of page values. One example might be species or genes, of which there are many many many pages, but each page gets minimal traffic.

When we sit down to consider it, we intuitively assume that traffic on Wikipedia follows a power law distribution among articles, with some small % of articles representing the vast majority of views. However, the precise breakout is unknown. Another unfortunate proxy we use is to look at the top 100 pages (which are mostly celebrities, media and recent events) and to deduce that this means the majority of pageviews belong to these topic areas. Depending on the shape of the distribution, this might be far from true.

Simple approach: Plot distribution

Per all the filters available to pageviews (platform, referrer, country, etc), we should establish what the distribution of pageviews is. Do the top 100 pages represent 5% of traffic or 30%? Aside from knowing this, it does not help us make many product-specific decisions

Robust approach: Per-dimension %

A more useful approach might be to provide a table whereby one could choose a category or other content qualifier (article class, article length, # of media files, # of inbound/outbound links, network centrality, etc) and then find out what % of pages and what % of pageviews they represent (again, filterable by the dimensions available for pageviews platform, referrer, country, etc).

Output mode

In either case, a pivot or superset implementation of the tool would be preferred so that PMs and analysts can drill down on their own and data analysts can focus on more advanced work

Event Timeline

MBinder_WMF triaged this task as Medium priority.Aug 2 2018, 8:23 PM
JKatzWMF lowered the priority of this task from Medium to Low.Sep 27 2018, 8:39 PM
nshahquinn-wmf added a subscriber: MNeisler.