Page MenuHomePhabricator

Assess the possibility of data release from a public health related research conducted by WMF and formal collaborators
Closed, ResolvedPublic

Description

Contact person

Michele Tizzoni, Ph.D. Research Leader, ISI Foundation (michele.tizzoni@isi.it)
in WMF: @leila (for sample data, please ping)

This task provides the relevant information to assess implications of a data release following the completion of the research project “Quantifying the global attention to public health threats through Wikipedia pageview data”.

The project has resulted in a scientific manuscript which is currently under submission, and it has been already published online on a public repository [1].
Following best practices in scientific publishing, we would like to release all the data underlying the main findings reported in [1] after the manuscript will be accepted for publication in an international peer-reviewed scientific journal, which will be Open Access.
Overall, the main reasons for releasing the data are:
Reproducibility. Releasing the data to the public domain will allow other researchers to replicate the study.
Increase the overall scientific impact of the data generated by the Wikimedia projects, especially in the field of public health. Data from Wikimedia are of high interest for researchers studying health seeking behaviors, diffusion of (mis)information and global attention to health issues. Releasing the data will foster new research on the above topics.

The data whose release needs to be reviewed can be divided into 3 different datasets, each corresponding to a Figure of the manuscript [1].

Dataset 1

This dataset underlies the results of Figure 1 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset: day, pageview_count
The dataset contains the daily number of pageview counts of 128 different Wikipedia pages related to the Zika virus (aggregated and summed to total) originated in the United States, from January 1st to December 31st, 2016.

Risk/benefit: The dataset allows to measure the dynamics of pageview counts that originated in the United States only, but the risks associated with releasing the dataset are minimal. The total volume of pageviews is aggregated at national scale. There is no information about pageviews to specific pages in specific languages.

Dataset 2

This dataset underlies the results of Figure 2 and Table 1 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset: day, pageview_count, state
The dataset contains the daily number of pageview counts of 128 different Wikipedia pages related to the Zika virus (aggregated and summed to total) originated in the United States, disaggregated by state, from January 1st to December 31st, 2016.

Risk/benefit: The dataset allows to measure the dynamics of pageview counts that originated in the 50 U.S. states, by state. This is a very valuable information to understand spatial differences and how the Zika epidemic has been perceived across the country. The risks associated with releasing the dataset are minimal. The total volume of pageviews is aggregated at state level. There is no information about pageviews to specific pages in specific languages. There is no way to single out from where the pageview count originated beyond the state level.

Dataset 3

This dataset underlies the results of Figure 3 in Ref [1].
The dataset is stored in the home directory of mtizzoni on stat*.
Content of each line of the dataset:
US_city, pageview_count_Zika,pageview_count_total
The dataset contains the total number of pageview counts of 128 different Wikipedia pages related to the Zika virus (pageview_count_Zika) originated in 788 cities (US_city) of the United States with population larger than 40,000 in 2016.
The dataset also contains the total number of pageview counts to all Wikipedia pages (all Wikipedia projects, pageview_count_total) originated in 788 cities (US_city) of the United States with population larger than 40,000 in 2016.

Risk/benefit: the dataset exposes the total number of pageviews related to the Zika epidemic at a relatively high spatial granularity and it also exposes the total volume of pageviews to all Wikipedia pages at the same spatial resolution. This would be extremely valuable for the scientific community as this type of data have never been made publicly available.
This information may be deemed sensitive by WMF, although there is no temporal information in this dataset (total by year) and the population size of each city is quite large (40,000).
To minimize the security concerns related to this dataset, a number of options are available.
The simplest and most effective would be limiting the release to the largest cities only (population>100,000). This will reduce significantly the risks associated with the spatial granularity, still allowing for reproducibility of the results.

Research's review
We are comfortable with the release of the datasets (with one potential suggestion below). (Thank you to Isaac for reviewing the data.)

Reasons:

  • It is historical data and not an ongoing data release so there is no chance of someone deliberately leading someone to visit a page in hopes of identifying them.
  • All of the webrequests associated with this data have long since been deleted so no subpoena etc. could result in more data being released around them.
  • The pageview data is aggregated over 128 articles and has the added benefit that IP geolocation and user vs. bot determination is an inherently noisy process.
  • Lowest counts
    • Dataset 1: no concerns -> lowest count for a day is above 1000 so very low privacy risk.
    • Dataset 2: many of the daily counts from smaller states are single-digit. I think given the historical nature of the dataset and that it aggregates over 128 articles and an entire state, that this is still fine. If I'm not mistaken, we expose daily page view counts to single articles that are this small through public-facing tools so this is even less concerning than that (for example, July 2: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Orrtanna,_Pennsylvania).
    • Dataset 3: the lowest count that is included in the dataset is for Pembroke Pines, FL, which has 2 pageviews to Zika articles (aggregated from 128 different articles) in 2016 and 4018 total pageviews (all articles) from 2016. If security/privacy is concerned about that given the geographic specificity, I would suggest changing any city data that is below 100 Zika pageviews (5 out of 715 cities) to 100 and making a note in the data release about this, but I do not see the necessity given the above points about the aggregated/historical nature of this data. In this case, 100 is an arbitrary cut-off as well but would have almost zero research impact and provides some extra guarantee of privacy.

References
Tizzoni M, et al. (2018) The impact of news exposure on collective attention in the United States during the 2016 Zika epidemic https://www.biorxiv.org/content/10.1101/346411v1

Event Timeline

Jcross triaged this task as Medium priority.Oct 15 2019, 5:38 PM

@JFishback_WMF can you give us a sense for when this release can be assessed? Also, any feedback you may have so far is greatly appreciated.

Thanks @leila ! I am about to re-submit the related manuscript after the first round of revisions, and I expect the Editor to ask me about data availability. Any feedback would be very helpful at this stage.

@leila and @Michele.tizzoni I'll try to take a look at this in the next couple of days, but might be early this coming week.

Hello @JFishback_WMF! Apologies for bothering you about this. Did you have time to take a look at it? Thanks so much!

Hey @Michele.tizzoni this got pushed out due to some competing priorities, but it's on my list for this coming week.

all: I have removed Research. I will remain on this task in case my help is needed anywhere.

Hello @Michele.tizzoni - I've been reviewing this task and had a question. Is the list of 128 articles public already? Is the plan to make it public? Or will readers just know that a given page view was to one of 128+ "Zika-related" pages without knowing which specific pages those might be?

Hello! The list of the 128 pages is not public yet. It appears as a Table in the Supplementary Information of the manuscript and we plan to make it public upon publication (the paper is under review). However, we don't want to make public the pageview count disaggregated by page but only the totals. Therefore, readers will just know that the pageview time series represents the sum of total pageview counts of the 128 pages without knowing the proportion by page.

Hello and happy new year! I just would like to inform you that the manuscript associated with the data has been accepted for publication in PLOS Computational Biology. I've received the notification yesterday.
Prior to publication, we should make a decision about the data release.
Thank you for your help and support.

Hello @Michele.tizzoni - I'm almost complete with the risk analysis except for Dataset 3. I'd like to look at the data but I can't seem to find it - do you know specifically which stat host it resides on?

@Michele.tizzoni congratulations. That's a great news! :)

@JFishback_WMF: following on our IRC conversation, I sent you an email about Dataset 3. Let me know if something is missing. Michele: no action is needed on your end for now unless you hear from JFishback_WMF or I. Thanks!

I completed the Privacy Risk Assessment with the following results:

Risk Assessment
Initial Risk: Low
Mitigations: Aggregation
Residual Risk: Low

The Wikimedia Foundation has developed a process for reviewing datasets prior to release in order to determine a privacy risk level, appropriate mitigations, and a residual risk level. WMF takes privacy very seriously, and seeks to be as transparent as possible while still respecting the privacy of our readers and editors.

Our Privacy Risk Review process first documents the anticipated benefits of releasing a dataset. Because we feel transparency is so crucial to free information, generally WMF takes a release-by-default approach - that is, release unless there is a compelling reason not to. Often, however, there are additional reasons for releasing a particular dataset, such as supporting research. We want to capture those reasons and account for them.

Second, WMF identifies populations that might possibly be impacted by the release of a dataset. We also specifically identify potential impacts to particularly vulnerable populations, such as political dissidents, ethnic minorities, religious minorities, etc.

Next, we catalog potential threat actors, such as organized crime, data aggregators, or other malicious actors that might potentially seek to violate a user’s privacy. We work to identify the potential motivations of these actors and populations they may target.

Finally, we analyze the Opportunity, Ease, and Probability of action by a threat actor against a potential target, along with the Magnitude of privacy harm to arrive at an initial risk score. Once we have identified our initial risks, we develop a mitigation strategy to minimize the risks we can, resulting in a residual (or post-mitigation) risk level.

WMF does not publicly publish this information because we do not want to motivate threat actors, or give them additional ideas for potential abuse of data. Unlike publishing a security vulnerability for code that could be patched, a publicly released dataset cannot be “patched” - it has already been made public.

Any dataset that contains this notice has been reviewed using this process.

@JFishback_WMF thank you very much for your thorough analysis.

@Michele.tizzoni I approve the release of the three datasets based on the Security's assessment. Please take the next steps and link to this task with where the data is released after you release it. (Check with Miriam if you need help with where to release the data, etc.) I'll resolve this task now.

We will met with @JFishback_WMF and let you know of next steps