Page MenuHomePhabricator

Publish data on seen page previews
Open, LowPublic

Description

With work on T186728: Record and aggregate page previews nearing completion, we should start considering what data from the resulting internal table wmf.virtualpageview_hourly can be made public, and in what form. This should enable e.g. our editing community and academic researchers to investigate questions such as:

  • How often has a given Wikipedia article been previewed on a given day?
  • What are the most previewed links in a given Wikipedia article?
  • On which Wikipedia articles in a given set is the preview feature used most often, on average?

For privacy reasons, we won't be able to make all of the data from wmf.virtualpageview_hourly public. This is similar to the situation for our existing data about normal pageviews, where the public datasets (e.g. pageviews, clickstream) contain less information than e.g. the private pageview_hourly and webrequest tables. In particular, we will probably want to remove to all or most of the data derived from IP and user agent such as geolocation or browser type.

  • Decide about the content of the public dataset(s)
  • Decide about format and location (e.g. an API like for pageviews, and/or a dataset published on dumps.wikimedia.org like for pageviews )
  • Productionize

CCing @Tomayac and @fhoffa who recently asked about this on Twitter (alongside another researcher, Dimitar Dimitrov).

Event Timeline

Vvjjkkii renamed this task from Publish data on seen page previews to nvdaaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Tbayer renamed this task from nvdaaaaaaa to Publish data on seen page previews.Jul 1 2018, 6:16 AM
Tbayer lowered the priority of this task from High to Medium.
Tbayer updated the task description. (Show Details)
Tbayer added a subscriber: Aklapper.
mforns lowered the priority of this task from Medium to Low.Apr 22 2019, 3:48 PM