Contact person
Michele Tizzoni, Ph.D. Research Leader, ISI Foundation (michele.tizzoni@isi.it)
in WMF: @leila (for sample data, please ping)
This task provides the relevant information to assess implications of a data release following the completion of the research project “Quantifying the global attention to public health threats through Wikipedia pageview data”.
The project has resulted in a scientific manuscript which is currently under submission, and it has been already published online on a public repository [1].
Following best practices in scientific publishing, we would like to release all the data underlying the main findings reported in [1] after the manuscript will be accepted for publication in an international peer-reviewed scientific journal, which will be Open Access.
Overall, the main reasons for releasing the data are:
Reproducibility. Releasing the data to the public domain will allow other researchers to replicate the study.
Increase the overall scientific impact of the data generated by the Wikimedia projects, especially in the field of public health. Data from Wikimedia are of high interest for researchers studying health seeking behaviors, diffusion of (mis)information and global attention to health issues. Releasing the data will foster new research on the above topics.
The data whose release needs to be reviewed can be divided into 3 different datasets, each corresponding to a Figure of the manuscript [1].
Dataset 1
This dataset underlies the results of Figure 1 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset: day, pageview_count
The dataset contains the daily number of pageview counts of 128 different Wikipedia pages related to the Zika virus (aggregated and summed to total) originated in the United States, from January 1st to December 31st, 2016.
Risk/benefit: The dataset allows to measure the dynamics of pageview counts that originated in the United States only, but the risks associated with releasing the dataset are minimal. The total volume of pageviews is aggregated at national scale. There is no information about pageviews to specific pages in specific languages.
Dataset 2
This dataset underlies the results of Figure 2 and Table 1 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset: day, pageview_count, state
The dataset contains the daily number of pageview counts of 128 different Wikipedia pages related to the Zika virus (aggregated and summed to total) originated in the United States, disaggregated by state, from January 1st to December 31st, 2016.
Risk/benefit: The dataset allows to measure the dynamics of pageview counts that originated in the 50 U.S. states, by state. This is a very valuable information to understand spatial differences and how the Zika epidemic has been perceived across the country. The risks associated with releasing the dataset are minimal. The total volume of pageviews is aggregated at state level. There is no information about pageviews to specific pages in specific languages. There is no way to single out from where the pageview count originated beyond the state level.
Dataset 3
This dataset underlies the results of Figure 3 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset:
US_city, pageview_count_Zika,pageview_count_total
The dataset contains the total number of pageview counts of 128 different Wikipedia pages related to the Zika virus (pageview_count_Zika) originated in 788 cities (US_city) of the United States with population larger than 40,000 in 2016.
The dataset also contains the total number of pageview counts to all Wikipedia pages (all Wikipedia projects, pageview_count_total) originated in 788 cities (US_city) of the United States with population larger than 40,000 in 2016.
Risk/benefit: the dataset exposes the total number of pageviews related to the Zika epidemic at a relatively high spatial granularity and it also exposes the total volume of pageviews to all Wikipedia pages at the same spatial resolution. This would be extremely valuable for the scientific community as this type of data have never been made publicly available.
This information may be deemed sensitive by WMF, although there is no temporal information in this dataset (total by year) and the population size of each city is quite large (40,000).
To minimize the security concerns related to this dataset, a number of options are available.
The simplest and most effective would be limiting the release to the largest cities only (population>100,000). This will reduce significantly the risks associated with the spatial granularity, still allowing for reproducibility of the results.
Research's review
We are comfortable with the release of the datasets (with one potential suggestion below). (Thank you to Isaac for reviewing the data.)
Reasons:
- It is historical data and not an ongoing data release so there is no chance of someone deliberately leading someone to visit a page in hopes of identifying them.
- All of the webrequests associated with this data have long since been deleted so no subpoena etc. could result in more data being released around them.
- The pageview data is aggregated over 128 articles and has the added benefit that IP geolocation and user vs. bot determination is an inherently noisy process.
- Lowest counts
- Dataset 1: no concerns -> lowest count for a day is above 1000 so very low privacy risk.
- Dataset 2: many of the daily counts from smaller states are single-digit. I think given the historical nature of the dataset and that it aggregates over 128 articles and an entire state, that this is still fine. If I'm not mistaken, we expose daily page view counts to single articles that are this small through public-facing tools so this is even less concerning than that (for example, July 2: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Orrtanna,_Pennsylvania).
- Dataset 3: the lowest count that is included in the dataset is for Pembroke Pines, FL, which has 2 pageviews to Zika articles (aggregated from 128 different articles) in 2016 and 4018 total pageviews (all articles) from 2016. If security/privacy is concerned about that given the geographic specificity, I would suggest changing any city data that is below 100 Zika pageviews (5 out of 715 cities) to 100 and making a note in the data release about this, but I do not see the necessity given the above points about the aggregated/historical nature of this data. In this case, 100 is an arbitrary cut-off as well but would have almost zero research impact and provides some extra guarantee of privacy.
References
Tizzoni M, et al. (2018) The impact of news exposure on collective attention in the United States during the 2016 Zika epidemic https://www.biorxiv.org/content/10.1101/346411v1