There has been quite a number of requests for a granular dataset for Wikipedia's data when it comes to views per article per country (project-title-country). This ticket is to explore whether releasing this data with a differentially private approach might be possible. Project here refers to the wikipedia in question. ex: es.wikipedia.org
Wikimedia releases a lot of data, granular counts of data per project are already available. In many instances project is a good proxy for language and those counts are very granular (example: Greenlandic (?) wikipedia). See very granular counts that probably originate in Greenland: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/kl.wikipedia/all-access/2019/10/01
Now, while a lot of information might be disclosed on small projects about users viewing habits the bulk of the value of the project-title-country data is actually in articles that exist in a number of projects (this means that projects are of a certain size) but that might have small viewership on a certain country at some point in time.
Example question: "when do the pageviews for covid start getting to be significant in Itally/San Marino in 2020?"
This dataset has many of the issues that are also outlined in the upcoming release of "top pageviews per country" . See: T207171: Have a way to show the most popular pages per country
Namely: data for small countries or small (country, project) pairs discloses a lot of info. Example: users reading malasyan wikipedia in San Marino.
In the case of T207171: Have a way to show the most popular pages per country the mitigation will be to not release that data as it is not useful for editors for the most part. In the case of the project-title-country dataset the idea is to explore whether there exists a strategy that would allow us to release, safely, granular counts.
- It might be possible to remove the project dimension entirely and just work with article-country using wikidata as a bridge to translate article titles among different projects.
- Given granular counts per project are released every hour, any strategy that would add noise on the project dimension is not effective as that noise can be removed easily.
- The initial assumption is that a k-anonymity solution that would allow us to release the data safely would loose a lot of signal (but that assumption might be wrong). Some of the issues are described here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly/Sanitization
(Please note that @Nuria no longer holds an appointment at WMF and she will be working on this project with Research and Analytics as a volunteer)