I analyzed data from the reading depth plugin in a research project studying differences in Wikimedia readership between the Global South and Global North. I think the data from the plugin can be very useful for outside researchers. I propose stripping potentially identifying variables from the data and releasing it to the public in aggregated form. Ultimately, it would be wonderful to put the data on PAWS, tools.wmlabs.org and stats.wikimedia.org. These would require someone to maintain a data pipeline to keep the data up-to-date.
So at first it would be easier to just generate .csv files and publish them to a public repository like figshare.
I think we should release a dataset with the following variables:
date | page_title | page_id | namespaceId |
mean_totalLength | mean_visibleLength | mean_log_totalLength | mean_log_visibleLength |
median_visibleTotalLength | median_totalLength | var_totalLength | var_visibleLength |
var_log_totalLength | var_log_visibleLength | sample_size | |
I propose including the totalLength values because it is a simpler measure that some analysts can conceivably prefer over visibleLength.
I think it is important to provide means, variances, and sample_size so we can support statistical comparisons of reading times. Since the data are skewed, statistics on the log-transformed data are probably more appropriate for statistical analysis, but also including the untransformed data seems like a good idea to me.
It is important to provide aggregations of logged data because skewness in the data violates normality assumptions.
It might also be useful to also aggregate by logged-in status (the isAnon variable in the schema). If others agree then it would be good to add it as well, but it might make using the data more complicated, since it would mean having more than 1 row per page and date.
In the paper mentioned above, we used some data based on GeoIP tables that I don't expect can be made public. I think that the aggregations described above pose minimal threats to the privacy of readers.
I do not know the process for releasing data to the public like this, so I'm looking for help with that!
I still have cluster access and am willing to do the work to setup a pipeline to publish the datasets.