Page MenuHomePhabricator

Publish aggregated reading time dataset
Closed, ResolvedPublic

Description

I analyzed data from the reading depth plugin in a research project studying differences in Wikimedia readership between the Global South and Global North. I think the data from the plugin can be very useful for outside researchers. I propose stripping potentially identifying variables from the data and releasing it to the public in aggregated form. Ultimately, it would be wonderful to put the data on PAWS, tools.wmlabs.org and stats.wikimedia.org. These would require someone to maintain a data pipeline to keep the data up-to-date.
So at first it would be easier to just generate .csv files and publish them to a public repository like figshare.

I think we should release a dataset with the following variables:

datepage_titlepage_idnamespaceId
mean_totalLengthmean_visibleLengthmean_log_totalLengthmean_log_visibleLength
median_visibleTotalLengthmedian_totalLengthvar_totalLengthvar_visibleLength
var_log_totalLengthvar_log_visibleLengthsample_size

I propose including the totalLength values because it is a simpler measure that some analysts can conceivably prefer over visibleLength.

I think it is important to provide means, variances, and sample_size so we can support statistical comparisons of reading times. Since the data are skewed, statistics on the log-transformed data are probably more appropriate for statistical analysis, but also including the untransformed data seems like a good idea to me.

It is important to provide aggregations of logged data because skewness in the data violates normality assumptions.

It might also be useful to also aggregate by logged-in status (the isAnon variable in the schema). If others agree then it would be good to add it as well, but it might make using the data more complicated, since it would mean having more than 1 row per page and date.

In the paper mentioned above, we used some data based on GeoIP tables that I don't expect can be made public. I think that the aggregations described above pose minimal threats to the privacy of readers.

I do not know the process for releasing data to the public like this, so I'm looking for help with that!

I still have cluster access and am willing to do the work to setup a pipeline to publish the datasets.

Event Timeline

We have several datasets pending release, it is not likely we can look at this until couple quarters from now, Q3 maybe? See: https://phabricator.wikimedia.org/T131280 and https://phabricator.wikimedia.org/T208612 that are on the works.
Also, if we want this to be a one off data release @Groceryheist can probably handle it. If this is a request is for data that gets released on a schedule (every week, month..) the reading team (cc @phuedx) would need to agree to own the instrumentation of Reading_depth going forward.

Hi Nuria. I'm proposing to start with a one-off release that I can handle easily. I can also do some work to set up automated scheduled releases, but I don't want to commit to owning it in the long run.

Hi Nuria. I'm proposing to start with a one-off release that I can handle easily.

Sounds good, a one-off release is just a file with data on a public folder, no pipeline of any sort is needed. There are two things that need to happen, the data needs to go through a privacy review and it needs to be throughly documented . An example of a similar dataset: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

At this time we are not sure that the Reading_depth instrumentation will be maintained going forward so I think anything beyond a one-off release is out of scope.

Documenting any caveats with data re important, for example, in a multi tab browsing situation is this data of quality? If I open two tabs with wikipedia content , does the data take into account I can only read a tab at a time? or does the counter for the total-time-in-page continue counting? (thus rendering one of the two tabs measures incorrect)

Thanks Nuria!

I think that both measures might be useful for different purposes. It might also be interesting to compare the two measures for example to see if people tend to open tabs to read later.

I am also happy to do the documentation as we have already done analysis to understand limitations in the research report I linked to above.

One thing I can use help with is the privacy review process. I don't know how the process works or how to initiate it. If that is documented anywhere it would be useful if you can share that with me.

Removing the instrumentation without replacing it with something better would be unfortunate. But I agree that it would be a waste of effort to build an automated pipeline if the data source were to disappear.

@Groceryheist the data stream is been deactivated (this means that instrumentation is not sending data, but instrumentation is still present)

One thing I can use help with is the privacy review process. I don't know how the process works or how to initiate

The privacy framework to gauge risks of datasets is being developed, at this time you just need to open a ticket that discloses what data you wish to release (for which dataset needs to exists) , an example https://phabricator.wikimedia.org/T217318 (this is a more detailed version of this ticket). please have in mind that we are working in several other data releases so I doubt we can get to this one this quarter.