With work on T186728: Record and aggregate page previews nearing completion, we should start considering what data from the resulting internal table wmf.virtualpageview_hourly can be made public, and in what form. This should enable e.g. our editing community and academic researchers to investigate questions such as:
- How often has a given Wikipedia article been previewed on a given day?
- What are the most previewed links in a given Wikipedia article?
- On which Wikipedia articles in a given set is the preview feature used most often, on average?
For privacy reasons, we won't be able to make all of the data from wmf.virtualpageview_hourly public. This is similar to the situation for our existing data about normal pageviews, where the public datasets (e.g. pageviews, clickstream) contain less information than e.g. the private pageview_hourly and webrequest tables. In particular, we will probably want to remove to all or most of the data derived from IP and user agent such as geolocation or browser type.
- Decide about the content of the public dataset(s)
- Decide about format and location (e.g. an API like for pageviews, and/or a dataset published on dumps.wikimedia.org like for pageviews )
CCing @Tomayac and @fhoffa who recently asked about this on Twitter (alongside another researcher, Dimitar Dimitrov).