Improve joining mechanism between webrequest data and edit data for i.e. sampling pageviews
Open, NormalPublic

Description

There is now an API to get pageview counts, but there is another task which is important and frequently needed and would be a great addition, which is getting pageview samples. (I.e. select pages by random, weighted by page view counts.)

This is hard to do in a general way, but with the assumption that people would typically want to sample article-namespace views from all articles of a given wiki, the same design that's behind Special:Random could be used:

  1. assuming we have N rows in the daily (monthly, whatever) table for a given wiki with pageview counts p1...pN, store for row k the value sum[i:1..k] pi / sum[i:1..N] pi as field random
  2. index the table on random
  3. on every API request, for each element in the sample, generate a random number in [0..1] and return the row for SELECT ... WHERE random < [number] ORDER BY RANDOM DESC LIMIT 1
Tgr created this task.Feb 8 2016, 11:45 PM
Tgr updated the task description. (Show Details)
Tgr raised the priority of this task from to Needs Triage.
Tgr added a project: Analytics.
Tgr added a subscriber: Tgr.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 8 2016, 11:45 PM
Milimetric triaged this task as Normal priority.Feb 11 2016, 6:10 PM
Milimetric set Security to None.
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.
Nuria added a subscriber: Nuria.Oct 24 2016, 3:47 PM

Who would be the users of such an API? What is the value proposition of such an APi?

Also, you can select a random (time-wise) subset of pageviews using hive commands, seems that this would be sufficient.

Tgr added a comment.Oct 24 2016, 9:58 PM

Random pages are useful for checking the user impact of a bug or a feature that does not handle some special case. If I sample 1000 pages and 10 of them look broken, that means roughly 1% of our pageviews will be affected. That is a more useful metric than the number of pages affected (if the main page breaks that is a big deal, even if it is just a single page). For real life examples see T120504 and T125977.

Yes, this can be done manually with a hive query, but then that can be said about all Analytics APIs :)

Nuria added a comment.Oct 25 2016, 2:44 PM

Yes, this can be done manually with a hive query, but then that can be said about all Analytics APIs :)

Not really, the APIs exists to provide aggregated (long term) data for WMF and the community. In this case you are looking for an aid for operational issues, Seems your use case is just WMF-internal and shortly-lived thus better served by an ad-hoc query than a full fledge api.

mforns added a subscriber: mforns.Jul 31 2017, 4:07 PM

From backlog grooming meeting team discussion: It looks as this use case is not suited for a salable API. It would better be solved by hive queries.
Now, a better way to join the webrequest data and the edit data would help in that.

mforns renamed this task from Provide API for sampling pageviews to Improve joining mechanism between webrequest data and edit data for i.e. sampling pageviews.Jul 31 2017, 4:08 PM
mforns moved this task from Dashiki to Backlog (Later) on the Analytics board.
fdans added a subscriber: fdans.Oct 9 2017, 4:19 PM

You can join webrequest and edit data using page ids for desktop and mobile web traffic but not for app traffic

fdans moved this task from Backlog (Later) to Deprioritized on the Analytics board.Oct 9 2017, 4:20 PM
Tbayer added a subscriber: Tbayer.May 31 2018, 11:43 PM