Page MenuHomePhabricator

Make aggregate data on editors per country per wiki publicly available
Closed, ResolvedPublic0 Estimated Story Points

Assigned To
Authored By
Nuria
Mar 30 2016, 7:04 PM
Referenced Files
None
Tokens
"Love" token, awarded by Pamputt."Love" token, awarded by Amire80."Love" token, awarded by nshahquinn-wmf."Love" token, awarded by MelodyKramer."Love" token, awarded by Jane023."Love" token, awarded by leila.

Description

The requestor of this data is Asaf on behalf of Emerging Wikimedia Communities

  • bucketed active and very active counts per country, plus per region, plus global north vs global south -- all current generated by the 'geowiki' code in the private area.
  • archive of these by month

Ability to link to reports permanently so "current (bucketed) active editors for Ghana" can be made a permanent link that would lead to the latest numbers.

The bucket size requested is 10 for editing data. I think analytics team did some work in the past that proved that this data size is too coarse to be released to the public. Asaf mentioned that legal had signed off in this data request but while legal can determine whether it infringes the privacy policy is a lot harder for them to know the security risk that the release of a new dataset poses, as attacks depend on what other datasets are available.

A link to the WIP data access guidelines being worked on now: https://office.wikimedia.org/wiki/User:Csteipp/Data_access_guidelines

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

As T188859 has been merged with this task several months ago, is there any progress on this topic? I think it would be really interesting to have this for the next Wikisats 2.

We are now in the process of clarifying with legal team where is the privacy threshold for this data and whether it is OK, on their view, to disclose the geo location of an editor, likely a topic contended in our community.

Is there a particular reason the draft data access guidelines linked above need to be specifically confidential? If not, could they be posted to a public wiki?

Somewhat related: T207171: Have a way to show the most popular pages per country.

And I join the people who are asking to make this information available.

Information about countries where the numbers are so small that they can generate privacy problems should be available, too, but it can be restricted to trusted people. ("Trusted" doesn't necessarily mean "WMF staff"; it can be anyone with good intentions, understanding of privacy concerns, and proof of need-to-know.)

@Yair_rand The guidelines are not confidential, they apply to a set of systems that only people working at WMF (and research collaborators) have access to , as such it is posted to WMF internal wiki.

@Nuria The WMF internal wiki isn't publicly accessible, making its contents unavailable to the community. Unless there's particular reason the guidelines need to be withheld, I would think that reasonable transparency would require that they not be hidden inside the WMF's private wiki.

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

FYI @Amire80 https://phabricator.wikimedia.org/T207171 has no privacy issues for the most part as the data will be shown only for the highest buckets.

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

Yes, but the guidelines @Yair_rand mentioned are significantly more detailed about the actual practices people with access to data should follow, and people outside the Foundation have a legitimate interest in knowing what they are, whether to assess the trustworthiness of our data analysis practices or (for volunteers with access to private date) actually follow them.

I've just filed T219542 to work further on this.

Milimetric moved this task from Done to Paused on the Analytics-Kanban board.

Some recent work on this.
@JFishback_WMF is working on risk assessment framework with legal that we can apply to data releases such as this one. I took a look at the data harvested after the major refactor of the data harvesting jobs. As fas i can see on the daily edits tally of eswiki and arwiki about half of edits daily (agreggated for all countries) come from anonymous-editors. Once we aggregate the data monthly this ratio changes dramatically. There are about 5% authenticated editors and 95% anonymous editors on the monthly tally (again, aggregated per country). Pinging @Asaf in the per country releases requested are you also thinking about anonymous editors?

Note: I am taking here about "editors" not "edits", the number of "edits" by authenticated users is much greater.

Yes, anonymous editors matter too. Though I am mostly interested in the old "active" (>5/month) and "very active" (>100/month) definitions, and have next to no interest in the >1/month group. Presumably a large section of the anonymous editors make only a single edit, or fewer than 5 anyhow.

@Asaf : that is correct, only about 5% of the anonymous "entities" have more than 5 edits.

The bucket size requested is 10 for editing data.

This might be counter intuitive but the bucket size of the country has little to do with the privacy associated with being able to geo-locate an editor.

Example:
If there are 10,000 edits for eswiki across 10 countries, and one of those countries , say, Argentina, has 25 people on the 100+ edit bucket you can find easily who are those 25 people by looking at our public edit data (every single edit is public) and looking at how many editors for, eswiki did 100+ edits that month. If there are 50 editors with 100+ edits for eswiki that month and we have reported buckets like (Agentina, 25), (Brazil, 15) (Other, 10) the probability of any one editor selected at random of being in Argentina is 25/50 so 0.5 which is is pretty high.

So, in this case, privacy is a function of the total number of editors in the 100+ bucket plus the distribution across countries of the number of editors. A distribution of (Argentina,40), (other, 10) in the example above would tell you that 4/5 of editors with 100+ edits this month where located in Argentina, so while the country bucket is larger in the second example the probability of localizing an editor to the actual country has not decreased but increased.

An idea would be to report editors (without buckets, just the ones with 5+ edits on content pages) where the probability of identifying the country of an editor to a country is less than a threshold. If legal thinks that there is no issue with precisely identifying editors to a country this distinction does not matter but regardless the requested "bucket size of 10" does not add any "privacy budget" as far as I can see.

Proposal:

  • let's release data for editors with 5+ edits per country (regardless of size of bucket) per wiki, let's not release distinctively the 5+ and 100+ buckets
  • some countries in which surveillance is prevalent will be blacklisted and no data will be released. See: https://dash.harvard.edu/bitstream/handle/1/32741922/Wikipedia_Censorship_final.pdf
  • let's not release data for countries whose population is below a threshold, regardless of size of bucket.

Per @ezachte's criteria of "live" wikipedias this data should not include dead/un-editable wikipedias,. Erik's word on this regard below:

Leila and I looked into how many wikis have at least a modicum of activity.
Even when we set the threshold for activity quite low, at 3 or more editors doing 5 or more edits a month, still the majority of our wikis doesn't qualify.
In July 2016 only 344 out of 865 wikis qualified.
For the Wikipedias specifically, about 160 meet that very low target, which is stable since 2008
Other projects: https://stats.wikimedia.org/EN/ProjectTrendsActiveWikis.html
BTW we feel the extreme lower limit of 1 or more active editors is, well too extreme.
Wikis only start to come to life for real, when editors collaborate or at least potentially vet each others work.

BTW, my notebook on this is Test_geoeditors_pyspark.ipynb

Quick status update. I am currently evaluating ways to release this data. This is just while we wait for our privacy framework to be finished. As soon as that's done, we can evaluate our possible solutions here and execute the release fairly quickly.

Following up on @Nuria's proposal, @Ijon stated above that he needs the 100+ bucket. We should abide by this request, unless he states otherwise. The only possible compromise I can think of is releasing only whether or not there are 100+ editors, like so:

wikicountry5+ editors100+ editors exist
eswikiArgentina30-40True
eswikiOther10-20False

I also think the privacy analysis in T131280#5322897 should be looking at buckets as follows. In the example where the set of countries of eswiki editors is (Argentina, Other), if the raw data is (Argentina: 36, Other: 12), then the bucketed output would be: (Argentina: 30-40, Other: 10-20). An attacker trying to determine the probability that an editor is in Argentina would have to look at the range from: (Argentina: 30, Other: 19) to: (Argentina: 39, Other: 11). This would make P(editor-in-Argentina) range from 61% to 78%. This is a fairly wide margin, introducing a bit of uncertainty, and it's the most degenerate example. Most active wikis will have five or more countries, so for example if we have:

wikicountry5+ editors
eswikiArgentina10-20
eswikiSpain10-20
eswikiMexico30-40
eswikiColombia10-20
eswikiUSA10-20

Then P(editor-in-Mexico) would range from (Mexico: 30, Argentina: 19, Spain: 19, Colombia: 19, USA: 19) to (Mexico: 39, Argentina: 10, Spain: 10, Colombia: 10, USA: 10) which is 28% to 49%. This is not a rigorous way to look at the data release, and that's one of the things I'm working on now. But it does show some justification for the bucketing approach.

Thanks for making progress on this!

I'm afraid a boolean on the 100+ is next to useless, though: practically all wikis have at least that one person hacking away at a 100+ level. It's important for us to be able to determine whether there are, say, at least 10 100+ editors in a country. Being a more robust number than the easily fluctuating 5+ count, it is also the best measure for a local group/affiliate to assess the "community size" (as it, at some point, moves from bucket to bucket).

I don't want to re-hash the conversation we've had several times by now about how we never committed to not revealing a user's country, and how, while ideally it would be impossible to determine, mitigating the risk of it being determined should not *obviously* be prioritized over the very immediate benefit to our communities.

Instead, if it would help get this feature out the door at long last, perhaps we can blacklist a few countries where we have concrete reasons to fear the risk of identifying a contributor's country (e.g. some central Asian dictatorships), and provide bucketed results for the rest?

Thanks very much. I also would much prefer not to rehash the conversations we've already had. We're ready to release this data, and the work we're doing to find the best way to release it is just preparation until the privacy framework is ready. We just want to be able to justify our decisions. Bucketing and blacklisting seem like they fit into the privacy framework drafts I've seen so far, and I'll take some time during paternity leave to fill out that rationale if needed. So, here's where we are so far:

  • bucketing and blacklisting seem like the best approach we've come up with so far (me and others are looking into alternative approaches)
  • we will release 5+ counts and 100+ counts per wiki per country
  • we will do all of this as soon as the privacy framework is available

Unless I'm very much mistaken, the described system will make it possible to determine the country of specific editors, in some situations.

(Is one not supposed to mention specific possible privacy-related attacks on public tasks? If so, how would one bring those up?)

@Yair_rand, that's what we're trying to prevent, yes. The value of the data is great, and the risk will be minimized as much as possible. As Asaf points out above, we have had this conversation for a very long time. Our legal and security teams have thought about the potential danger of this dataset and signed off on us publishing it. Nevertheless, I personally would like to protect this dataset as much as possible and that's why I'm looking into how to make it harder to determine the country of specific editors. Does that make sense? Do you have additional concerns?

I could use a collaboration on the list of countries to blacklist. The paper that Nuria mentions includes: China, Cuba, Egypt, Indonesia, Iran, Kazakhstan, Pakistan, Russia, Saudi Arabia, South Korea, Syria, Thailand, Turkey, Uzbekistan, Vietnam. But the reason for censorship is pretty different in each country, and they don't all seem like they need a blacklist. I tried to guess at a first draft of the blacklist but honestly I'm not sure. The governments in not just those countries but those regions seem pretty troubling to me. And I don't have enough knowledge to know when something goes from troubling to dangerous.

Change 530878 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] [WIP] draft of outputting druid geoeditor queries

https://gerrit.wikimedia.org/r/530878

My latest patchset on that change above is just a draft implementing some of the thoughts so far. It implements the following so that we have a place to start from when we finalize our thoughts on privacy here:

  • respects a blacklist of countries (actual countries belonging in the list TBD, perhaps Trust-and-Safety have a list of countries we can start our blacklist from?)
  • only reports 5 to 99 and 100 or more activity levels
  • only reports numbers for Wikipedia projects or central projects like commons
  • only considers wikis with 3 or more active editors overall (not per country)
  • buckets output to obscure exact numbers and add uncertainty in a probabilistic attack

See the insert script for more details

Unless I'm very much mistaken, the described system will make it possible to determine the country of specific editors, in some situations.

yes, to be clear that is a risk that we can mitigate, not eliminate

(Is one not supposed to mention specific possible privacy-related attacks on public tasks? If so, how would one bring those up?)

on the contrary we welcome those

@Milimetric I think for our first release we can remove all countries mentioned on the surveillance report, we can work on adding them later if we think that to be a safe measure.

Several examples of ways to find out a user's country: (Note: A lot of this depends on how frequently this report will be published.)

  • Data is published, some time passes, during which precisely one new editor has shown up who wasn't there previously. The counter for one country goes up a step from, say, 20-30 to 30-40. This user's country has been revealed.
  • A very small wiki that normally wouldn't have the data published has exactly one editor. A malicious user trying to determine that editor's country creates nine "active" accounts in a particular country. If the counter shows 10+, the country has been revealed. The malicious user can repeat, and also test many countries at once to narrow it down.
  • Combination of the above tactics: A malicious user tries to keep several countries' counters at the boundary, so that either a new user will cause the count to increment, or a leaving user will cause it to decrement. If all users joining in the same duration but one can be ruled out by time-zone data or similar, the remaining user's country has been revealed.
  • There are ten active editors on a project. One of the countries shows 10-20. The country of all of the editors has been revealed.
  • There are 16 active editors on a project. A malicious user creates four active accounts in a country. The counter shows 20-30. The country of all of the editors has been revealed.

@Yair_rand you examples show ingenuity, yet they also seem somewhat contrived. Suppose some malicious geeky and rather obsessed user would go to such length to 'exploit a weakness' in the privacy protection, and they learn about the country of a wikimedian who doesn't want to reveal themselves, how much damage could be done? Say China, with its enormous resources finds out that 16 active editors on a small wiki all edit from Taiwan. How much would they have learned then? Taiwan has 23+ million population. That geeky detective could probably also learn from text analysis (English isn't spoken the same in different countries), from analysis of edit times (where waking hours is a proxy for time zone), from edits being spaced wider apart from countries with low bandwidth. I admit all contrived examples as well, and only effective in combination, and in the hands of a very geeky and obsessed malicious user with infinite resources. It's probably easier for such a geek to infiltrate our security by social engineering, placing a mole, and what have you.

As I mentioned on my prior post I do not think is such a great idea to release +5 and 100+ counts separately. Also , to correct what @Milimetric says above per @Erik_Zachte definition we consider an active wikipedia one with 3+ editors with 5+ edits per month.

Ideas to lower the number of potential problems:

  • Make the numbers less precise, somehow? Something like making each unique editor have a 10% chance of being skipped in the count and 10% chance of being counted twice, or some such. (This would need to persist per editor between reports, I think. Maybe even do the same thing to the changes in numbers when someone leaves.)
  • Don't generate a new report when only one editor has joined or left since the previous report. Perhaps even per-country, don't update unless there's a change of at least several counted users since the last report.

(Probably stupid question: Aren't there people who actually specialize in this kind of thing, with established methods for preventing leaking data on particular individuals?)

(Probably stupid question: Aren't there people who actually specialize in this kind of thing, with established methods for preventing leaking data on particular individuals?)

Of course, there are many statistical methods to mask data. Now, the base issue is that the requesters of this data (@Asaf and others) think that leaking data on particular users is not an issue and you are contending the opposite. This request is for data on a very granular scale.

@Ijon I'm working on a blacklist, and wanted to check with you to see how it would impact the usefulness of the dataset. I'll write more details but basically on the advice of folks more familiar with censorship of Wikipedia I'm using scores from Reporters Without Borders [1] and Freedom on the Net [2]. It makes more sense to blacklist the top wikis used in each of these countries, but that is often English Wikipedia and things get confusing. So sticking with the simpler approach of just blacklisting the countries, here's a list of the worst offenders according to those two sources:

  • Libya, Egypt, Somalia, Equatorial Guinea, Azerbaijan, Bahrain, Yemen, Cuba, Iran, Laos, Saudi Arabia, Djibouti, Syria, Sudan, Vietnam, China, Eritrea, North Korea, and Turkmenistan [1]
  • Kazakhstan, Myanmar, Belarus, Thailand, Sudan, Venezuela, Turkey, Russia, United Arab Emirates, Bahrain, Egypt, Saudi Arabia, Pakistan, Uzbekistan, Vietnam, Cuba, Syria, Ethiopia, Iran, and China [2]

The union of the two is: Azerbaijan, Bahrain, Belarus, China, Cuba, Djibouti, Egypt, Equatorial Guinea, Eritrea, Ethiopia, Iran, Kazakhstan, Laos, Libya, Myanmar, North Korea, Pakistan, Russia, Saudi Arabia, Somalia, Sudan, Syria, Thailand, Turkey, Turkmenistan, United Arab Emirates, Uzbekistan, Venezuela, Vietnam, Yemen

This does not include a lot of countries that seem fairly hostile to journalism according to the sources, but including those seemed to render the dataset almost useless. Let us know what you think soon so we can refine our approach if needed.

[1] Reporters Without Borders: https://rsf.org/en/ranking
[2] Freedom on the Net: https://freedomhouse.org/report/freedom-net/freedom-net-2018/rise-digital-authoritarianism

@Milimetric let's go ahead and implement code with the union of the two and refine once we hear from @lijon

@Ijon I'm working on an explanation to the community for why we feel this is a low risk data set to release in spite of the ability to potentially identify an editor's country. My understanding is that this data set (or something like it) was previously published but then pulled down due to community response - do you have any links to previous community concerns? I'd like to address the issues specifically, if possible, or at least understand the specific issues raised. Thanks in advance.

No, I don't think that's what happened at all. This has been held back for several years now (I have been requesting this since 2014) by objections made by engineers in (or formerly in) the Analytics team. I have not seen negative community responses, except the couple of comments upthread, made following the concerns Analytics had raised.

To my mind, the tremendous, concrete, immediate value of these public metrics -- as the single most robust metric of the strength (and growth/decline over time) of an editor base in a particular *country*, so relevant to all country-level groups -- far outweighs the potential exposure of a user's country to a determined attacker. (We have been tolerating much greater risk to all "anonymous" editors, whom we in fact expose much more drastically by posting their IP addresses with precise timestamps.)

@Ijon Sorry - I was under the impression that there had been some community push back at some point in the past. I must have misunderstood. I think we're all sold on the benefits of releasing this data, I was just seeking to specifically try to address community concerns that I thought had been raised.

Very well. I look forward to the release, thanks.

@Milimetric thanks for bringing this to completion

I see from https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors that anonymous editors are included as well.
Question:
Do we have any estimate of how many ip addresses are not tied to one person?
Either because a provider assigns a new address per session from a pool (-> overcount).
Or because one ip address is shared by a pool of users, e.g. in schools, libraries and cybercafes (-> undercount).

Change 530878 merged by Milimetric:
[analytics/refinery@master] Publish monthly geoeditor numbers

https://gerrit.wikimedia.org/r/530878

Change 545982 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[operations/puppet@production] Sync geoeditors data to dumps and add links

https://gerrit.wikimedia.org/r/545982

review from @Ottomata is appreciated. And btw, why is the mediawiki_history fetch disabled? Is there some (problem/decision to be made) with the (rsync/labs servers) that would affect this as well?

@Milimetric the nediawiki history is quite a large fileset that requires a hadoop client on the dump servers to rsync in a timely manner, this dataset is a lot smaller (in terms of files and dataset size) so it should not have that problem

Change 545982 merged by Ottomata:
[operations/puppet@production] Sync geoeditors data to dumps and add links

https://gerrit.wikimedia.org/r/545982

This task has been marked as resolved, so now what about T188859? Will this information be available on https://stats.wikimedia.org/v2/#/all-projects?

@Pamputt at some point yes. Now, that update will not happen in the next couple of quarters so for now the data will only exist is public files.

I have reopened your task as it was not really a duplicate of the original one but rather a later phase of the project.