With work on {T186728} nearing completion, we should start considering what data from the resulting internal table `wmf.virtualpageview_hourly` can be made public, and in what form. This should enable e.g. our editing community and academic researchers to investigate questions such as:
- How often has a given Wikipedia article been previewed on a given day?
- What are the most previewed links in a given Wikipedia article?
- On which Wikipedia articles in a given set is the preview feature used most often, on average?
For privacy reasons, we won't be able to make all of the data from `wmf.virtualpageview_hourly` public. This is similar to the situation for our existing data about normal pageviews, where the public datasets (e.g. [[https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews|pageviews]], [[https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream|clickstream]]) contain less information than e.g. the private [[https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly|pageview_hourly]] and webrequest tables. In particular, we will probably want to remove to all or most of the data derived from IP and user agent such as geolocation or browser type.
[ ] Decide about the content of the public dataset(s)
[ ] Decide about format and location (e.g. an API [[https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews|like for pageviews]], and/or a dataset published on dumps.wikimedia.org [[https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews|like for pageviews]] )
[ ] Productionize
CCing @tomayac and @fhoffa who recently [[https://twitter.com/WikiResearch/status/986558448429969408 |asked about this on Twitter]] (alongside another researcher, Dimitar Dimitrov).