Maniphest T193524

Publish data on seen page previews
Open, LowPublic
Actions

Assigned To

None

Authored By

	• Tbayer
	May 1 2018, 6:47 PM

Description

With work on T186728: Record and aggregate page previews nearing completion, we should start considering what data from the resulting internal table wmf.virtualpageview_hourly can be made public, and in what form. This should enable e.g. our editing community and academic researchers to investigate questions such as:

How often has a given Wikipedia article been previewed on a given day?
What are the most previewed links in a given Wikipedia article?
On which Wikipedia articles in a given set is the preview feature used most often, on average?

For privacy reasons, we won't be able to make all of the data from wmf.virtualpageview_hourly public. This is similar to the situation for our existing data about normal pageviews, where the public datasets (e.g. pageviews, clickstream) contain less information than e.g. the private pageview_hourly and webrequest tables. In particular, we will probably want to remove to all or most of the data derived from IP and user agent such as geolocation or browser type.

Decide about the content of the public dataset(s)
Decide about format and location (e.g. an API like for pageviews, and/or a dataset published on dumps.wikimedia.org like for pageviews )
Productionize

CCing @Tomayac and @fhoffa who recently asked about this on Twitter (alongside another researcher, Dimitar Dimitrov).

Related Objects
Search...

Status	Assigned	Task
Open	None	T193524 Publish data on seen page previews
Resolved	mforns	T186728 Record and aggregate page previews
Resolved	None	T184793 [EPIC] Instrument page interactions
Resolved	ovasileva	T182414 [Spike] How can we measure seen page previews with as high a degree of accuracy as possible?
Resolved	• Tbayer	T190188 VirtualPageView schema should not use EventLogging api to send virtual page view events
Resolved	None	T191471 VirtualPageViews should send titles with spaces substituted with underscores
Resolved	mforns	T192305 Index and store page preview agreggates on Druid so they are visible in pivot/superset
Resolved	phuedx	T196904 Some VirtualPageView are too long and fail EventLogging processing
Declined	None	T197243 Change virtualpageview agreggation so it does not use source_url
Resolved	Ottomata	T186833 Include X-Client-IP in EventLogging data and geocode during Hive JSON Refinement
Duplicate	mforns	T188310 Write agreggation job for eventlogging page preview data