Page MenuHomePhabricator

Evaluate a differentially private solution to release wikipedia's project-title-country data
Closed, ResolvedPublic

Description

There has been quite a number of requests for a granular dataset for Wikipedia's data when it comes to views per article per country (project-title-country). This ticket is to explore whether releasing this data with a differentially private approach might be possible. Project here refers to the wikipedia in question. ex: es.wikipedia.org

Wikimedia releases a lot of data, granular counts of data per project are already available. In many instances project is a good proxy for language and those counts are very granular (example: Greenlandic (?) wikipedia). See very granular counts that probably originate in Greenland: https://wikimedia.org/api/rest_v1/metrics/pageviews/top/kl.wikipedia/all-access/2019/10/01

Now, while a lot of information might be disclosed on small projects about users viewing habits the bulk of the value of the project-title-country data is actually in articles that exist in a number of projects (this means that projects are of a certain size) but that might have small viewership on a certain country at some point in time.

Example question: "when do the pageviews for covid start getting to be significant in Itally/San Marino in 2020?"

This dataset has many of the issues that are also outlined in the upcoming release of "top pageviews per country" . See: T207171: Have a way to show the most popular pages per country

Namely: data for small countries or small (country, project) pairs discloses a lot of info. Example: users reading malasyan wikipedia in San Marino.
In the case of T207171: Have a way to show the most popular pages per country the mitigation will be to not release that data as it is not useful for editors for the most part. In the case of the project-title-country dataset the idea is to explore whether there exists a strategy that would allow us to release, safely, granular counts.

Some thoughts:

  • It might be possible to remove the project dimension entirely and just work with article-country using wikidata as a bridge to translate article titles among different projects.
  • Given granular counts per project are released every hour, any strategy that would add noise on the project dimension is not effective as that noise can be removed easily.

Of interest:
https://meta.wikimedia.org/wiki/List_of_Wikipedias

(Please note that @Nuria no longer holds an appointment at WMF and she will be working on this project with Research and Analytics as a volunteer)

Event Timeline

Hi all, I'm one of the folks behind https://github.com/google/differential-privacy. We'd love to help you with your use case =)

A few questions:

  • what is the schema of the original data you can use as input? Do you have identifiers of some kind?
  • what is the schema of the output data you want to generate? Do I understand correctly that you want to count views per <project,title,country,time>, where time is some temporal?
  • would it be an option for you to vary the granularity of the data depending on quality (e.g. report daily stats when there is enough data, and weekly/monthly otherwise)?

@TedTed Super thanks for chiming in

what is the schema of the original data you can use as input?

Let me know if this one is clear enough:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly

what is the schema of the output data you want to generate? Do I understand correctly that you want to count views per <project,title,country,time>, where time is some temporal?

Yes, that is correct

I think even something like <title, country, time> would work cause wikidata could provide an identifier that would work for say the Covid-19 article in all different languages in which the article exists
@Isaac can chime in if this is fundamentally incorrect

would it be an option for you to vary the granularity of the data depending on quality (e.g. report daily stats when there is enough data, and weekly/monthly otherwise)?

I think this would work yes.

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly

This is super helpful, thank you. I see there is a view_count field there. Do I understand correctly that it means that the data was already aggregated? Say that field has value 5, does it means that the page had 5 different views, potentially from 5 different users (but all with the same country, user_agent_map, etc.)? If so, what is the typical distribution of values for view_count?

It looks like there are no user identifiers in the data. This makes sense, but if means that you can't easily use DP and get formal guarantees on the contributions of a single user. At a high level, DP hides the presence or absence of something in a dataset, and we call that "something" a privacy unit. In the academic literature, the privacy unit is usually "an individual", or "the contributions of an individual during a fixed period of time", but doing that in practice requires a user ID. There are two alternatives:

  • We can use other fields to "approximate" a user ID. For that, we need fields such that if two records have different values for these fields, then they probably come from different users. In your dataset, this could be e.g. user_agent_map+city. The problem is that if many people share the same values of these fields, this can severely hurt utility.
  • We can use a simpler privacy ID, like a single view. In that case, you can have strong guarantees that you're successfully hiding the existence of a single view. In that case, users who contribute many many views have weaker formal privacy guarantees. Their individual views are still hidden, but their total contribution might not be.

That said, I want to emphasize that both options are markedly better than using k-anonymity or some other syntactic definition: the "choice of privacy unit" problem also arises with k-anonymity, and it doesn't provide you with any formal guarantee.

If we use the "simpler privacy ID" option, you can compensate for the weaker guarantee by choosing a smaller privacy parameter. For example, you could look at identified data to find out information like "99% of Wikipedia users are associated with fewer than 5 pageviews per day". Then, if you want to have a parameter ε=1 for 99% of users, with a privacy unit of user-day, you can simply use a smaller epsilon (1/5 = 0.2) for each view. (Depending on the numbers, we might optimize this further, but that's the general idea.)

The above are all about policy questions — what anonymization strategy you want to adopt. For more practical concerns, it seems that you're using Hadoop with Spark. That's good news, because our main open-source end-to-end tool for large-scale differential privacy pipelines is built on top of Apache Beam, which should work with Spark. (Although full disclosure: we haven't actually tested it on Spark yet.)

Hope I'm not drowning folks in technical jargon — I'm happy to set up a call at some point if you think that'd be helpful.

(Obvious disclaimer: all opinions/advice contained there and in other messages in this thread are mine and don't necessarily represent my employers' view.)

Say that field has value 5, does it means that the page had 5 different views, potentially from 5 different users (but all with the same country, user_agent_map, etc.)?

Yes, exactly, same country, same (broadly) user agent and same article.

It looks like there are no user identifiers in the data. This makes sense, but if means that you can't easily use DP and get formal guarantees on the contributions of a single user.

Right, this is why in the wikimedia ecosystem where we shy away from unique identifiers (see: https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/) it is been hard use open source solutions for DP that assume a single contribution from every user or even, just a way to identify users.

In your dataset, this could be e.g. user_agent_map+city. The problem is that if many people share the same values of these fields, this can severely hurt utility.

For the utility of this dataset * I think* this might not be the best option cause while there are definitely unique-enough identifiers for *some* of the data, for the majority of the data that is actually not the case.
We did an study about identity reconstruction with this data a while back: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly/Identity_reconstruction_analysis

I will run some new numbers to further quantify this but measures such us: https://9to5google.com/2020/01/14/google-deprecate-chrome-user-agent-string-privacy/ make me think that the "simpler privacy ID" might be a better option.

It seems that next steps are to do some tryouts exploring the idea of "a single view" as a privacy guarantee and try the end to end tool for DP (thanks for releasing that!). It might not get to this for a month due to personal reasons but it is on my radar.

Super thanks for elaborating, EVERYONE on this ticket is loving the technical jargon.

In your dataset, this could be e.g. user_agent_map+city. The problem is that if many people share the same values of these fields, this can severely hurt utility.

Open question: Would using a fingerprint user-identifier based on user-agent and IP as privacy-unit? I can't picture what would be the downside of such an approach.

It seems that next steps are to do some tryouts exploring the idea of "a single view" as a privacy guarantee and try the end to end tool for DP (thanks for releasing that!). It might not get to this for a month due to personal reasons but it is on my radar.

Sounds good. As you said, the tool kind of assumes that there are user IDs and that the data isn't pre-aggregated, but you can get around this technical limitation with some pre-processing. Namely:

  • Put the original data in a PCollection
  • With a ParDo function, for each record in the PCollection where view_count=x, emit x pairs <ID_x, <article,country>>, where the ID_x are randomly generated "identifiers" (with sufficiently random bits that there aren't collisions between records)
  • Transform this into a PrivatePCollection, using the ID_x as privacy unit.
  • Use pbeam.Count to generate the differentially private statistics, with MaxPartitionsContributed=1 and MaxValue=1 (each record appears in only one partition and increases the count by at most 1).

This isn't super optimized, but should work.

Open question: Would using a fingerprint user-identifier based on user-agent and IP as privacy-unit? I can't picture what would be the downside of such an approach.

My understanding was there there aren't any IP addresses in Pageview_hourly. If there were, that would be an option. This would provide stronger guarantees than using individual views as privacy unit, although of course a single user can change IP addresses over time or even short periods when using mobile data.

Open question: Would using a fingerprint user-identifier based on user-agent and IP as privacy-unit? I can't picture what would be the downside of such an approach.

My understanding was there there aren't any IP addresses in Pageview_hourly. If there were, that would be an option. This would provide stronger guarantees than using individual views as privacy unit, although of course a single user can change IP addresses over time or even short periods when using mobile data.

We have IPs in a temporary dataset, called pageview_actor that feeds into pageview_hourly, so that's where we'd get the fingerprint Joseph is talking about. We could insert two steps in between these datasets, like:

  • 1. synthetic or real fingerprints
    • for old data, retrofit a synthetic fingerprint as best as we can, along the lines of @TedTed's advice above
    • for new data, use a real fingerprint( user_agent, IP, ... )
  • 2. apply DP using a simpler version of the process described in T267283#6608103 (basically, because ID_x would be provided by step 1. above)

If we do it this way, pageview_hourly would have different guarantees and utility for historical vs. new data, but that feels like a tradeoff we can live with. And we can adjust how we retrofit that synthetic fingerprint to see how close we can get. Either way, as @TedTed points out, whatever approach we take (except to leave the data as is) would limit the utility of the historical pageview_hourly data even more.

We have IPs in a temporary dataset, called pageview_actor that feeds into pageview_hourly, so that's where we'd get the fingerprint Joseph is talking about. We could insert two steps in between these datasets,

There is a few reasons why I think that is far from a solution we want to pursue and why I mentioned the schema of the original data to be pageview_hourly (long retention) rather than webrequest (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest, 90 day retention)

Let me try to explain:

  • It would prevent us from releasing any past data which is the ultimate goal, be able to release 5 years of data. What @Milimetric calls a synthetic fingerprint just does not work. From my prior work on this which, granted, was couple years back, the long retained data is just to "coarse" to compute signatures and that and that is a good thing cause otherwise we fundamentally be going against Wikimedia data retention guidelines that explicitly prohibit keeping browsing sessions for longer than 90 days. This is quite an important point. Please see: https://meta.wikimedia.org/wiki/Data_retention_guidelines#How_long_do_we_retain_non-public_data?

And more importantly:

  • Data releases could not be rerun after 90 days as signatures can only be computed for 90 days after data intake and they are discarded after. This is, I think, a strong downside.
  • And a minor point that is significant: actor signatures represent pseudo-sessions, not users, thus would be questionable to claim privacy guarantees based on a "user" entity that does not exist on any of wikimedia's pageview datasets.

With this in mind I think exploring alternatives that are less obvious, such us centering our privacy guarantees around views rather than fingerprinting seem much more inline with Wikimedia's philosophy around privacy and also, potentially might allow the release of data from years back with ability to do re-runs.

Thanks all for this fascinating (and hopefully productive) conversation!

With this in mind I think exploring alternatives that are less obvious, such us centering our privacy guarantees around views rather than fingerprinting seem much more inline with Wikimedia's philosophy around privacy and also, potentially might allow the release of data from years back with ability to do re-runs.

I lean towards agreeing with this but am not fully decided:

  • I agree for both the reasons mentioned by @Nuria and others: while using the fingerprints in theory allows for more fine-grained but still privacy-preserving data, it introduces a fair bit of extra assumptions around the stability of these fingerprints that aren't particularly well understood and vary across geographies and time. I'd rather see us start with the easiest-to-understand assumptions about the data and then debate what the right epsilon value is to @TedTed's point about average pageviews per user-day. I think it'll be challenging enough to choose the right value without also having to debate how good our fingerprint is. Maybe my mind will change when we see the difference between a system implemented on fingerprints and one implemented on views, but I probably default to views as well.
  • On the other hand, we do have some very big outliers and not just because of shared IP but users who actually view hundreds of pages per day (some of which are easily identifiable because of edit logs). Maybe we can't really design a system that is flexible to them but we should certainly remember to evaluate whatever comes out of this with them in mind. Being able to have different policies depending on whether a user is logged-in or not (i.e. more privacy when logged-in) would be another possible solution but I suspect that this is out of reach because logged_in isn't currently part of pageview_hourly and shouldn't be stored permanently there (in my opinion).

Other thoughts:

  • Is there ability to have the privacy threshold evolve over time? For example, use a tighter threshold in the days/weeks after the data is produced and slowly back that off so that e.g., a year later, the amount of randomness introduced is less and the dataset has more research value. The thinking being that the privacy risks will decrease over time though maybe that's not a legitimate assumption to make.
  • I would love to get an early understanding of what this data would look like so I'm doing more than guessing at the impact of different choices. For instance, how close is a differentially-private top-100-most-viewed articles list to the raw data by country (T207171)?
  • How do we choose a starting epsilon value? It sounds like it's easy to interpret changes to epsilon (e.g., 0.2 for a user with 5 pageviews is the same privacy level as 1 for a user with a single pageview) but it's not clear to me what an epsilon of 1 means to begin with.

potentially might allow the release of data from years back with ability to do re-runs

It seems worthwhile to point out that when using differential privacy, every release you do counts towards the total privacy parameter ε (and δ, if applicable), making the privacy guarantees weaker as your release more data. To be clear, this might sound scary (as the guarantees provided by ε weaken exponentially fast as ε increases), but this is still significantly better than other privacy definitions that do not provide any guarantee whatsoever when doing multiple releases.

Is there ability to have the privacy threshold evolve over time?

With differential privacy, there's a threshold only if you don't have an a priori list of all possible partitions. In your case, I think you do have such a list, since the list of countries and articles is public information. So you wouldn't need a threshold at all to obtain strong privacy guarantees, only noise. This doesn't prevent you, of course, from adding a threshold for quality reasons (if you add noise with σ=10, you might decide to drop all partitions whose count is lower than 50, because the noise makes these statistics too unreliable), but that doesn't have an impact on the privacy guarantees.

It is possible to come up with a scheme that reduces the privacy guarantees over time. For example, you could make an initial release differentially private using Gaussian noise, then recompute the exact same data after a certain time has passed, and publish the average between the two releases. Since the average of two Gaussian distributions is a Gaussian distribution, it would be relatively straightforward to calibrate the privacy guarantees of such a scheme.

I would love to get an early understanding of what this data would look like so I'm doing more than guessing at the impact of different choices. For instance, how close is a differentially-private top-100-most-viewed articles list to the raw data by country (T207171)?

This is a good question; we have a tool to quickly visualize the impact of differential privacy on data quality, but it's not open-source nor available externally (yet). To give some intuition, if the privacy unit is a single view, the standard deviation of the noise using the Laplace mechanism (a standard choice) is sqrt(2)/ε, and the 95% confidence interval is [-ln(0.05)/ε,ln(0.05)/ε]. Say you choose a small ε of 0.1; this confidence interval is approximately [-30,30]. If you consider a 3% relative error at 95% confidence level acceptable, this means that if I understand this analysis correctly, you could publish the full top 100 for Belgium pretty accurately.

How do we choose a starting epsilon value? It sounds like it's easy to interpret changes to epsilon (e.g., 0.2 for a user with 5 pageviews is the same privacy level as 1 for a user with a single pageview) but it's not clear to me what an epsilon of 1 means to begin with.

I'll try to stay away from pushing you towards one particular policy decision or another — it's not my place, since I haven't been involved in Wikimedia projects so far, and I don't know about the context/precedent on your privacy posture. What I can give you is a link to a (hopefully) intuitive explanation of the meaning of ε, and to examples of ε/δ values in publicized data releases using central differential privacy.

(Edited: this comment used to contain a list of DP releases and their ε values, this got replaced by a link to a more detailed and up-to-date list on my blog.)

Thanks @TedTed for all these pointers, on my end I need to digest all this info before I can get back to you, others here might have more questions.

I have issues picturing how ε/δ changes will affect us :)
If feasible, I'd love to see some tests and data analysis with various values of those, as well as with different privacy-units (single-view and user-fingerprint at least).

fdans triaged this task as Medium priority.Nov 9 2020, 4:34 PM
fdans moved this task from Incoming to Datasets on the Analytics board.

@JAllemandou Given that user fingerprinting on pageview_hourly data is not effective (and if it were to be it would be a problem) I *think* I am going to center my efforts - when, ahem, I can get to this - in other privacy 'units'

@JAllemandou Given that user fingerprinting on pageview_hourly data is not effective (and if it were to be it would be a problem) I *think* I am going to center my efforts - when, ahem, I can get to this - in other privacy 'units'

Ack! Thanks for letting me know and for continuing to work with us @Nuria :)

Aklapper subscribed.

Resetting assignee (inactive account)

@Aklapper I assigned to myself again after my account was re-activated

Hey all -- happy new year!! I had differential privacy on the mind and my feeling is that we were stalled on how to choose the parameters that determine how much "privacy" is being assured. So I made an attempt to set up a simple interface to help us see the impact of different choices of parameters for differential privacy on potential result lists. I know that we should make our decisions about privacy separate from how it affects the perceived utility of the results but I found this really useful for thinking about the different approaches / parameters and what they mean. I mostly based the tool and my corresponding thoughts on the links that @TedTed shared to differential privacy blogposts, Facebook's mobility data, and Google's search trends data (thank you!).

You can play with the tool here [1]: https://diff-privacy.toolforge.org/

Details

  • Data: because the project-title-country data that we want to release under a differential privacy framework isn't public, I used the public daily top-viewed-articles-by-language data as a proxy [2].
  • Differential Privacy Approach: my tool uses the Laplace mechanism where e.g., if the actual data is 100 pageviews for a given page, then you'd instead release 100 + noise pageviews where noise is a number drawn from the Laplace distribution (with scale based on the parameters below) [3]. Our data is a pretty standard use-case for differential privacy so I don't think requires anything particularly fancy [4]. In the tool, I show both the actual count (e.g., 100) and the count after adding the noise so we can begin to develop an intuition for how different privacy parameters affect the results.
  • Privacy Parameters (the things we control and that affect how "private" the data is -- the tool lets you experiment with these):
    • privacy unit: what are we seeking to keep private? My default assumption is that we're going for a user-day -- i.e. someone shouldn't be able to figure out whether a given user is in a daily release of project-country-title. This is a pretty conservative approach that also guarantees that you couldn't figure out whether a given user viewed a given title and so I think is where we should start from [5].
    • sensitivity: quoting from the excellent Dwork and Roth (2014): "[sensitivity] captures the magnitude by which a single individual’s data can change the function in the worst case, and therefore, intuitively, the uncertainty in the response that we must introduce in order to hide the participation of a single individual." In our case, how many pageviews can a single person contribute to the data release? As our prior discussion has made clear, this is the central challenge for Wikimedia even though it's quite easy for pretty much every other platform. While platforms like Google or Facebook can reasonably say and track that one phone = one person or one account = one person, we purposefully don't have consistent identifiers for who generated what data and instead use a concatenation of a device's user-agent and IP address for this purpose. This is great because we have the built-in privacy of ourselves not being able to easily link pageview data back to individuals but it makes it really hard to guarantee that a given user/person will only contribute at most sensitivity pageviews to our data. However, for differential privacy to work, we have to cap the total number of data points that a single individual can contribute across all the pages in a given daily release of data. I see two potential options but there's no good way around having to rely on our (flawed) UA+IP fingerprints to implement this upper bound on contribution if we want user-level privacy [6]:
      • Simple but noisy: if we only are reporting pageviews from the user agent_type (no spiders / automated), then a single UA+IP can contribute up 800 pageviews per day. Thus, without any additional processing, we can set the sensitivity to 800. This is kinda unfortunate though because it is quite rare that a single individual actually views 800 pages in a day but it's a pretty big number so it introduces a lot of noise into the results, especially when pageview counts are in the low 100s.
      • Complex but less noisy: we do some additional processing to cap contributions by a given UA+IP. We probably limit the number of pageviews a UA+IP can contribute to a given page to 1 and then choose a max number of pageviews they can contribute (e.g., 10). If they view more than 10 articles in a day, we randomly select 10 and discard their other pageviews. This is a pretty common tactic (example from Google) and allows us to then set our sensitivity to e.g., 10 and inject far less noise into the data. In practice, we'd probably choose the max so that it covers e.g., 99% of readers or something along those lines.
    • epsilon: this is a judgment call. The default in the field seems to be setting this to 1, which corresponds to the following (based on this): if someone is unsure of whether an individual viewed a page in the dataset (i.e. they think 50% chance they did), they would be at most 73% sure the individual did or did not view the page after seeing the data we release (assuming we get the sensitivity correct). Lowering epsilon increases the privacy (by increasing the noise added). The tool I built will tell you what interpretation there is of the epsilon value provided when you hit Submit (code here).
  • Data Quality Parameters: once we've set the privacy parameters, we still might want to do things to reduce the likelihood of releasing highly unreliable data into the world. While we largely can't prevent the odd outlier that claims a page viewed once per day spiked to 10,000 pageviews, we can ensure that most of the data that we release is more signal than noise. I found two approaches for this:
    • Threshold based on actual data / top-k approaches: this actually isn't a great approach because if you e.g., say we'll only release data for pages that receive at least 200 pageviews, then the fact that a page is missing from the dataset could reveal a lot of information. We could consider only releasing data for certain Wikipedia projects or e.g., only the top-1000 articles per project, but to me this kinda defeats the purpose of using differential privacy to try to release this data that previously we couldn't guarantee privacy for.
    • Threshold after noise is added: this is much better from a privacy perspective (no privacy cost to doing this). You still get weird outliers because you don't distinguish between pageview counts that are high because many people viewed the article and pageview counts that are high because a lot of positive noise was added. But it provides some better guarantees that the resulting dataset is mostly useful information (otherwise for rare country-language pairs, you're largely just releasing a list of random numbers). I apply a flexible threshold in the tool based on an approach from Google where you calculate the likelihood that the real datapoint is within X% of the noisy data point and threshold based on that. So in the tool, noisy data points that are calculated to have less than a 50% chance of being within 25% of the actual value are greyed out. The 50%/25% parameters can be adjusted but are a reasonable starting place.
    • Aggregate temporally: this builds on the thresholding post-noise above but instead of removing the data point, we make a decision to withhold the full dataset for a country-language pair for a week or a month and then release a dataset based on that longer time period. This has no privacy cost and waiting longer means higher pageview counts which means better data. In practice, you probably determine daily vs. weekly vs. monthly based on historical data and maybe update that choice annually. Certain language+country pairs probably never reach useful thresholds even after a year -- e.g., San Marino + Quechua -- and for these we can also just exclude them (again, reevaluating annually).

My Proposal

In order to have a good balance between privacy and data quality, I think we need to cap pageviews per UA+IP to something like 5. I actually like that this limit is low because it should also reduce the outsized influence of power readers so the data better represents the broader reader population. To me, an epsilon value of 1 would be acceptable (based on what I've read and what other orgs have done as highlighted in T267283#6610013). However, to account for our difficulty with actually capping user contributions to the dataset, I think we lower our epsilon value to something like 0.5 (essentially saying we aim to achieve epsilon=1 differential privacy even for people with 2 UA+IPs per day). And because this choice (sensitivity=5; epsilon=0.5) means that a page has to be getting ~30 pageviews in a country to really have good data, we probably want to allow for either the daily or weekly/monthly cadence depending on the language+country .

Footnotes

[1] You can also set the form fields via URL parameters if you want to share an example (though keep in mind the noise is random so rerunning the same query will give you different results and the underlying data is from "yesterday's" pageview data, so it will also change daily): https://diff-privacy.toolforge.org/?lang=es&eps=0.5&sensitivity=800#diff-privacy
[2] I show the top-50 articles in the tool to give a sense of how the data changes with different parameterizations of differential privacy. While I think the goal is to not just release e.g., top-1000-viewed articles per project-country but all article view data, choosing projects like Ewe Wikipedia give very low pageview counts so you can see how different approaches would affect the long-tail of articles.
[3] I also round the noisy pageview results to an integer to make it more "realistic" and set a lower bound of 0. This is just a single line of code. This post-processing should add a small amount of additional noise too and is okay per this. For statistical reasons, however, I maybe shouldn't be setting the floor to 0 (see Warning 2 on page 2 of a Facebook data release explaining this). There are also concerns with floating point errors / lack of randomness in differential privacy, but rounding to integers likely solves much of this.
[4] My understanding is we're talking about global differential privacy (Wikimedia first collects the complete dataset and then applies DP), data from a given day will only be released a single time (so we don't have to worry about how many times someone might requery the database and track that), and we're dealing with pretty simple count data that is largely independent -- i.e. it's relatively fair to assume that knowing that a user viewed one page doesn't tell you much about the other pages they might have viewed. This final assumption of independence isn't fully true (people tend to follow links) but I think it's a fair assumption and the vast majority of people do read a single article and then leave the site during a reading session.
[5] We could use a different privacy unit. Specifically, we could follow Facebook's example here and seek to provide privacy guarantees at the project-country-title level but not the project-country level for each user -- i.e. guarantee that the question of whether a user viewed a given page in the dataset is unanswerable but allow that it would be possible to determine if a user who viewed many many pages in a given day was present (though again not which pages they viewed). This seems to be slightly more complicated -- it uses Gaussian noise instead of the Laplace mechanism and we'd have to decide an additional parameter to capture how strong of a protection we want to provide for heavy users. The benefit is that we can likely add far less noise to each individual project-country-title datapoint and still achieve very good protection. Facebook went with it for URL click data and essentially provided excellent privacy guarantees to all but the top 1% most-active users while keeping the data pretty accurate. And again, for those top 1% of Facebook users, it wasn't possible to tell whether they clicked on a given URL, just whether or not they were included in the data at all. Caring about that difference might seem silly but e.g., for us it would be the difference between someone being able to tell what country they were being geolocated to or whether they were being identified as a bot/automated (and filtered from the dataset).
[6] I think our UA+IP fingerprint is a reasonable enough proxy though there are many reasons to believe it will degrade over time unless we come up with other privacy-preserving methods. However, while we won't ever really be able to fully protect someone who views hundreds of pages per day from a wide range of UA+IPs (unless we e.g., drop all pageviews from signed-in accounts and direct people who are concerned about privacy to login), I think it's fair to acknowledge this but ensure that we do protect the "normal" person who views a page at most a few times in a day possibly from a few different UA+IPs. Additionally, people with e.g., VPNs that are changing their UA+IP will hopefully get filtered out as "automated" or get geolocated to many different countries which will provide some additional privacy as their pageviews will be scattered across different data points and have more noise added as a result.

This looks awesome @Isaac! Can't wait to try it out.

This looks awesome @Isaac! Can't wait to try it out.

Thanks! If there's anything that doesn't make sense, let me know and I'll try to explain. Or additional features that would be useful and I can see if they'd be easy to incorporate.

Parking some thoughts from my conversation with @Isaac after his good work this past couple weeks.

In Wikimedia's environment due to the massive amounts of data that we publish we need to asses whether the noise introduced in the dataset is truly indistinguishable from "actual" records. In a "close" environment like FB or Google this challenge doesn't exist so the DP theoretical guarantees are also "real" guarantees. I think (but @TedTed can correct me if this is wrong) that the public release of this data with noise is also subjected to be vulnerable to a linkage attack (in a similar way than a k-anonymization scheme would be) and we need some kind of adversarial testing that proves that noise and real records cannot be distinguished in the to-be-released dataset.

@Isaac, that's an amazing demo! I love this, thanks for your work and thoughtful analysis =)

A few minor points:

  • Rounding to the nearest integer gets rid of one obvious problem with insecure noise generation, but I'm not sure it'd get rid of all possible problems. To be on the safe side, I'd recommend using an open-source library that uses a secure noise generation method.
  • We've found that clamping results to 0 is typically better for data exploration purposes, dashboards, etc., while leaving negative results is better for scientific purposes, since it enables unbiased statistical analyses.
  • The "whether to clamp to 0" question essentially goes away if you use confidence intervals to remove unreliable results, so it's a good compromise. If you choose to do that, you can always keep the original noisy data (before filtering/clamping) somewhere, in case researchers have a specific analysis in mind and ask you for the full data later.
  • The temporal aggregation suggestion can be done automatically, we've done something like this for our Search Trends Symptoms Dataset release: https://arxiv.org/abs/2009.01265 (end of Section 4).

@Nuria, I'm not sure I understand your concern: DP guarantees are still present even if some (or even almost all) of the data is known by the attacker, and linkage attacks don't change that. I'm also not sure how I understand the kind of adversarial testing you refer to — in a setting like the one you describe, ywhat would the attacker know, and what would they be trying to find out?

in a setting like the one you describe, what would the attacker know, and what would they be trying to find out?

Sorry let me clarify: what would be known to an attacker is the exact pageviews per project per article, see: https://dumps.wikimedia.org/other/pageview_complete/readme.html
An attack might try to remove the noise in order to find the pageviews per article, per country.

Hopefully this makes sense.

Ah, I see. Yeah, no anonymization notion can change the fact that some data might be available without any guarantee because of some auxiliary release.

However, it does guarantee that the new differentially private release does not reveal significantly more information than what was previously available. The scenario you describe, where an auxiliary data release is used to estimate the true counts better than if an attacker only had access to DP counts, only results in catastrophic privacy loss if this was already feasible without the new DP counts. The Bayesian guarantees offered by DP bound the increase in information that the attacker can figure out with vs. without the new data release, no matter what the attacker knew before. This is called "resistance to auxiliary information" in the literature.

Maybe a closer look at a previous data release could suggest that using strong anonymization techniques on this new release might be partly pointless, because the information is available in some other way. But then, imho, the right course of action is still to ensure that this new release does not increase risk, which DP can guarantee; and as a second step look at past data releases and maybe change their strategy going forward (possibly with DP too), to increase the privacy properties of the releases over time.

that's an amazing demo! I love this, thanks for your work and thoughtful analysis =)

Thanks!

Rounding to the nearest integer gets rid of one obvious problem with insecure noise generation, but I'm not sure it'd get rid of all possible problems. To be on the safe side, I'd recommend using an open-source library that uses a secure noise generation method.

Makes sense. I personally will sleep better knowing that we're not using my own implementation too :) I think our default would be to try to use the Apache Beam implementation (I believe this one: https://opensource.googleblog.com/2020/06/expanding-our-differential-privacy.html) but that's outside my expertise.

We've found that clamping results to 0 is typically better for data exploration purposes, dashboards, etc., while leaving negative results is better for scientific purposes, since it enables unbiased statistical analyses.

Yeah, makes sense. At some point, I'm hoping to find some time or collaborators to look at the research side of this. Right now focused on getting a dataset that works for more qualitative uses (e.g., editors looking up top-edited pages in their region so they can improve them) but we're already brainstorming some simple tests to run on the data that would let us determine what sorts of research questions the data is still useful for and which it isn't. For example, correlation between # of pageviews and # of edits before and after applying DP.

The "whether to clamp to 0" question essentially goes away if you use confidence intervals to remove unreliable results, so it's a good compromise. If you choose to do that, you can always keep the original noisy data (before filtering/clamping) somewhere, in case researchers have a specific analysis in mind and ask you for the full data later.

Good point that any "prettification" of the data should probably be done on an application-specific basis as opposed to upfront when we store the noisy counts. Won't impact privacy at all even if the unfiltered, non-rounded, non-clamped values are shared.

The temporal aggregation suggestion can be done automatically, we've done something like this for our Search Trends Symptoms Dataset release: https://arxiv.org/abs/2009.01265 (end of Section 4).

Yep, that paper has been super helpful. My thought was that instead of using other languages/regions, we could just look back in the data to make the decision. So say we decide we'll release this dataset starting February 1st, then we'd first build the dataset for e.g., the final two weeks in January and if a given region+language has at least k articles that meet the release threshold on 7 of the days, we'd go with daily for that region+language. Otherwise, weekly (or monthly). Some caveats in that we'll probably also determine what languages+region pairs we never release because we don't expect them to have enough data (saves us some compute). We'd still do this all based on just the noisy data so no privacy budget would be consumed. I assume k would be somewhere in the neighborhood of 20ish to be deemed "useful" for editors.

But then, imho, the right course of action is still to ensure that this new release does not increase risk, which DP can guarantee; and as a second step look at past data releases and maybe change their strategy going forward (possibly with DP too), to increase the privacy properties of the releases over time.

Makes sense. I guess the point being that e.g., for Greenlandic Wikipedia, the DP pageview data wouldn't tell you much more than what you already could infer from just looking at the raw pageview counts to Greenlandic Wikipedia. Our thresholds/noise will be high enough that you'd never be able to look at a difference between all pageviews to a Greenlandic Wikipedia article and DP pageviews to that article in Greenland and say with any confidence what caused that difference (noise vs. someone reading the page from any given country).

One thing we should be aware of: right now, we don't need to bother adding noise to articles that received 0 pageviews across all regions to see if they clear the release threshold because there is nothing to be learned about the pageview counts to these articles (it's already known they received 0 pageviews in all regions). If we ever switch the raw daily public pageview counts to DP pageview counts, we'd have to revisit that and it would potentially add a fair bit of computation to this job because only 30% of articles on Wikipedia receive pageviews on any given day. Fun fact: out of the large wikis, English Wikipedia and Japanese Wikipedia are at the top with 67% and 74% of pages receiving pageviews daily while Cebuano/Swedish/Egyptian Arabic/Vietnamese/Waray Wikipedias are at the bottom with <10% of pages receiving pageviews daily. Cebuano is at 0.1% of pages receiving pageviews daily...

I also wanted to make an update to my proposal about release thresholds. I initially said let's just go with only releasing data that has at least a 50% chance of being within 25% of the actual value (what Google had used), which was 28 pageviews after adding noise for an epsilon of 0.5 and sensitivity of 5. For their data release, however, they had a relatively constrained number of geographic areas they could release data for. For us, however, we're talking about Wikipedia articles and English alone has >6M articles. I looked at what this 50/25 threshold would mean and because there are so many pages that get just a few pageviews and even an article with 0 pageviews would have a 3% chance of clearing 28 pageviews after noise, the resulting dataset is dominated by pageview counts that are not accurate. So I want to tighten my proposed threshold. For example, if we raise it to a 78% chance that the noisy value is within 10% of its true value, that becomes a threshold of 150 noisy pageviews to be released, and ~97% of the data that is released is accurate. We can play with the right thresholds, but the gist is that release thresholds for other datasets need to be greatly tightened for this dataset given the long-tail of articles with very few pageviews.

Had more time to review this and appreciate the coolness of this: "I apply a flexible threshold in the tool based on an approach from Google where you calculate the likelihood that the real datapoint is within X% of the noisy data point and threshold based on that. So in the tool, noisy data points that are calculated to have less than a 50% chance of being within 25% of the actual value are greyed out. The 50%/25% parameters can be adjusted but are a reasonable starting place"

There are a lot of valuable insights on @Isaac's work. I think next we could get a jupyter notebook going (if we can install (via pip) apache beam) and compare @Isaac results/thresholds with the ones bean would provide with a private collection.

not sure if we can invoke beam's java code through python bindings, if not in python seems that it would be possible in a scala notebook?

This data was released. Due to various technical factors, there are three distinct datasets:
https://analytics.wikimedia.org/published/datasets/

Following up on this as the primary person working on this project for the past 18 months with some details of how this dataset is different from the existing API data:

  1. The total number of global pageviews to be included in the dataset is 150 (vs. 1000 in the API)
  2. This data is split by country and project and page (vs. country and page [across multiple projects] or project and page [with multiple countries])
  3. This dataset uses differential privacy to add a small amount of random noise, in order hide the contribution of an individual to the dataset and prevent dataset linkage/reidentification attacks — this is the meaning of the "noise type" and "noise scale" parts of the README (vs. no noise added to the API)
  4. We have a minimum release threshold of:
    1. 90 pageviews per country-project-page (from 6 Feb 2023 to present)
    2. 450 pageviews per country-project-page (from 9 Feb 2017 to 5 Feb 2023)
    3. 3500 pageviews per country-project-page (from 1 July 2015 to 8 Feb 2017)