Page MenuHomePhabricator

Privacy pageview threshold for map report
Closed, ResolvedPublic13 Estimated Story Points

Description

We're adding a map component in WIkistats 2 that will visualise page views by country for either a given project or all Wikimedia projects. We need to define a threshold in the number of pageviews below which we won't be reporting a country for privacy reasons.

@ezachte I'd love to know the criteria you used regarding this on WiViVi.

Event Timeline

@fdans WiViVi doesn't use a threshold because the aggregation level is that high that individuals don't stand out from the crowd, except for fringe cases (and even then ... how serious is that ?)
WiViVi reports monthly request counts, broken down by originating country and target wiki.
Yes one person can account for all page requests for a very small wiki from a very small country (fringe case).
I have been working on the premise this is not a privacy hazard.

When we had extensive discussions on privacy and page views/edits long ago, as I recall, the discussion evolved around

  • page edits (alas those are no longer collected), editing sensitive articles is way more privacy sensitive than reading them (no WMF wiki only contains sensitive content)
  • reporting per article (WiViVi doesn't)
  • specific lat/long coordinates (where one could pinpoint a location, say to the level of a city) I used lat/long and timestamp both with obfuscation (random error) for viz. https://stats.wikimedia.org/wikimedia/animations/requests/ (countries played no role in that viz.)

@ezachte the trouble comes if there is so little reading traffic that it coincides with editing traffic. If that's the case, any pageview data could be used in combination with the public edit history. And if that pageview data includes geographic location, the potential danger increases.

But we have been talking about this in a theoretical way for a while, we're going to run the numbers and see what the reality is at a monthly/weekly aggregation. We'll post our findings, as I think this could unblock a lot of good work in this space.

@Erik_Zachte is this a fair summary of the restrictions in WiViVI?

  • For a Wikipedia to be shown, it has to have a minimum of 0.1% of all traffic in pageviews.
  • Data in the choropleth is bucketed.
  • Only exact data for the 23 top countries is shown for a given Wikipedia.
  • But there are Wikipedias, say for example, Kinyarwanda WIkipedia, that report 6 countries. Is this because there were only 6 countries visiting that Wikipedia, or because a country must have a minimum number of views to appear in the list?

I also notice that no country shows up that has fewer than 1.0K views. Is that another limit along with the 0.1%?

  • For a Wikipedia to be shown, it has to have a minimum of 0.1% of all traffic in pageviews.

Yes. That happens in perl job that prepares csv files: TrafficAnalysisGeo.pl
Only languages that receive > 0.1% for some country are listed for that country.

Also when a wiki doesn’t score 0.1% of pageviews for any country it is not in list of languages at all.

Also there is limit of 200 languages to report on, ranked by overall page views. (can't remember why (performance?), but it seems a safe limit given the other restrictions).

  • # Data in the choropleth is bucketed.

Not sure what you mean by that? If you mean there can only be so many entries per country in the csv file, there is a hard limit of max 100 languages per country. Other than that cutting off of number of languages (which meet 0.1% criterion) in a list happens in javascript

  • # Only exact data for the 23 top countries is shown for a given Wikipedia.

You mean in ' Pageviews to …, split by country' ?
For English Wikipedia I see 40 rows. But you may see less. The dialog adapts to window height, and javascript is a bit cautious here as it doesn't factor in font height, which can differ per user.

  • # But there are Wikipedias, say for example, Kinyarwanda WIkipedia, that report 6 countries. Is this because there were only 6 countries visiting that Wikipedia, or because a country must have a minimum number of views to appear in the list?

Yes cutting off point is 0.1% of overall views for that wiki.

  • No country shows up that has fewer than 1.0K views. Is that another limit along with the 0.1%?

Lowest is Eritrea with 1k views but that is actually an artifact as the csv file contains numbers rounded to 1k, ready for display. I'm working on a json equivalent for all 3 csv files. In json I will use raw numbers, not rounded ones. But the viz. will keep using csv files, and raw number increases the file size which are already large.

mforns set the point value for this task to 13.Dec 15 2017, 4:02 PM
mforns removed the point value for this task.
mforns set the point value for this task to 13.

I added some documentation for the concerns we discussed and the decisions we took:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews/Pageviews_by_country