Page MenuHomePhabricator

Make aggregate data on editors per country per wiki publicly available
Open, NormalPublic0 Story Points

Subscribers
Tokens
"Love" token, awarded by Pamputt."Love" token, awarded by Amire80."Love" token, awarded by Neil_P._Quinn_WMF."Love" token, awarded by MelodyKramer."Love" token, awarded by Jane023."Love" token, awarded by leila.
Assigned To
Authored By
Nuria, Mar 30 2016

Description

The requestor of this data is Asaf on behalf of Emerging Wikimedia Communities

  • bucketed active and very active counts per country, plus per region, plus global north vs global south -- all current generated by the 'geowiki' code in the private area.
  • archive of these by month

Ability to link to reports permanently so "current (bucketed) active editors for Ghana" can be made a permanent link that would lead to the latest numbers.

The bucket size requested is 10 for editing data. I think analytics team did some work in the past that proved that this data size is too coarse to be released to the public. Asaf mentioned that legal had signed off in this data request but while legal can determine whether it infringes the privacy policy is a lot harder for them to know the security risk that the release of a new dataset poses, as attacks depend on what other datasets are available.

A link to the WIP data access guidelines being worked on now: https://office.wikimedia.org/wiki/User:Csteipp/Data_access_guidelines

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Additional notes and use-cases, from a discussion in IRC:

  • Journalists regularly seek this kind of information, and want to know general statistics for their articles, e.g. the rough numbers in https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerLanguageBreakdown.htm#Portuguese are perfect for them (but that page was last updated in 2013)
  • User groups and documentation writers seek this kind of information, whilst trying to define the necessary background information to understand the needs specific to their communities.
    • knowing the current breakdown and the relative populations and internet penetration can give a good idea of the potential of each country, because some have very different challenges from the others.
  • It would be very useful to have an easy way to see any trends over time, so that the impact of things like outreach programs could perhaps be seen. (Either as a table like https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryTrends.htm or some sort of graph/visualization)
Nuria moved this task from Wikistats Production to Dashiki on the Analytics board.Apr 24 2017, 2:57 PM
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.Apr 26 2017, 3:26 PM
Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.
Nuria moved this task from Dashiki to Wikistats Production on the Analytics board.Jul 17 2017, 4:17 PM
Ijon added a comment.Jul 23 2017, 9:10 AM

Can someone offer a comment on why it was decided to move it from Q4 of last year to Q3 of this year? As a stakeholder, I wish I were included in the conversation that resulted in deprioritizing this.

Nuria added a comment.EditedJul 23 2017, 3:24 PM

@Ijon : we do not have enough resources to tackle this quarter or next, sorry about that. We think we can benefit from our recent work on edit data lake for this project but we cannot tackle it any sooner.

Items we need to close before tacking this one are the backend for wikistats 2.0 and the improvements to edit data lake such us: https://phabricator.wikimedia.org/T161147

Ijon added a comment.Sep 27 2017, 4:23 PM

Thank you, @Nuria, for the clear update. Do we have a solid solution for the concerns your team has had about leaking a user's country due to insufficient bucketing? I'd like some assurance that come Q3, the feature can finally be implemented, rather than someone discovering then that the privacy concern is still there, and us descending into yet another spiral on that question.
Forgive me if you hear a bitter overtone in the above question, but, really, we've been waiting for this data for over four years now, and it's been more than two years since Legal declared bucketing sufficient, so I am wary of additional delays and am seeking to ensure only unavoidable delays continue to delay us.

@Ijon think after our implementation of editing metrics that we hope to launch beginning on Q2 as part of the new wikistats 2.0 we are going to have more options as to how to compute this data with a level of fuzziness that is still helpful. (cc @Milimetric ) give us some space to get the alpha for new backend ready and then we will be ready to experiement with this on the new datasets on the data store , thus far our raw data for editing analytics does not have geo location (cc @JAllemandou to keep me honest)

Ijon added a comment.Sep 27 2017, 4:53 PM

Thank you for the quick response, @Nuria. It sounds as though there is a worthwhile discussion to be had on whether and how we can achieve the statistics requested in this ticket. Unless I misunderstood what you just said, it sounds as though even in Q3, the geo location would not, in fact, be part of the available data, unless Someone Does Something. So I am asking that Someone Do Something. If you need me to escalate this yet again before Someone can Do Something, please let me know and I shall do so.

It looks like this is now getting priority, so I'd like to get involved and set up a hopefully useful approach. Here's what I'm thinking:

  • Finish the subtask that Nuria just created, which is necessary for us to be able to work with this data. The legacy code has been a major part of the slowdown here, nobody knows it intimately and it's been handed over three times since it was originally written. Moving this to our standard tools and making it a first-class citizen will win us better monitoring and reliability. It will also give a couple of us the context we need to take the next step.
  • For the next step, I propose we don't argue too much about the privacy implications, but implement the simple bucketing that is suggested here, while in the back of our minds thinking about the issue of privacy
  • Once we have the simple bucketing, let's set up a meeting with legal and @Ijon to present it and show any possible attacks on this data that we think of. Hopefully there are none that we can think of, but if there are, this approach should de-mystify the problem for everyone and allow us to talk about it objectively.

Does that sound fair?

Ijon added a comment.Oct 2 2017, 6:11 PM

Thank you, @Milimetric. That would be great.

Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric set the point value for this task to 0.
Neil_P._Quinn_WMF renamed this task from Making geowiki data public to Make aggregate data on editors per country per wiki publicly available.Mar 5 2018, 7:54 AM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Wikistats Production to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Mar 21 2018, 5:40 AM
Neil_P._Quinn_WMF rescinded a token.
Neil_P._Quinn_WMF awarded a token.

As T188859 has been merged with this task several months ago, is there any progress on this topic? I think it would be really interesting to have this for the next Wikisats 2.

Nuria added a comment.Sep 30 2018, 4:04 AM

We are now in the process of clarifying with legal team where is the privacy threshold for this data and whether it is OK, on their view, to disclose the geo location of an editor, likely a topic contended in our community.

Is there a particular reason the draft data access guidelines linked above need to be specifically confidential? If not, could they be posted to a public wiki?

Tbayer removed a subscriber: HaeB.Feb 13 2019, 12:53 AM

Somewhat related: T207171: Have a way to show the most popular pages per country.

And I join the people who are asking to make this information available.

Information about countries where the numbers are so small that they can generate privacy problems should be available, too, but it can be restricted to trusted people. ("Trusted" doesn't necessarily mean "WMF staff"; it can be anyone with good intentions, understanding of privacy concerns, and proof of need-to-know.)

Nuria added a comment.Mar 22 2019, 3:00 PM

@Yair_rand The guidelines are not confidential, they apply to a set of systems that only people working at WMF (and research collaborators) have access to , as such it is posted to WMF internal wiki.

@Nuria The WMF internal wiki isn't publicly accessible, making its contents unavailable to the community. Unless there's particular reason the guidelines need to be withheld, I would think that reasonable transparency would require that they not be hidden inside the WMF's private wiki.

Nuria added a comment.Mar 25 2019, 2:56 PM

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

Nuria added a comment.EditedMar 25 2019, 11:10 PM

FYI @Amire80 https://phabricator.wikimedia.org/T207171 has no privacy issues for the most part as the data will be shown only for the highest buckets.

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

Yes, but the guidelines @Yair_rand mentioned are significantly more detailed about the actual practices people with access to data should follow, and people outside the Foundation have a legitimate interest in knowing what they are, whether to assess the trustworthiness of our data analysis practices or (for volunteers with access to private date) actually follow them.

I've just filed T219542 to work further on this.

Milimetric moved this task from Paused to Done on the Analytics-Kanban board.Apr 2 2019, 4:06 PM
Milimetric moved this task from Done to Paused on the Analytics-Kanban board.
Nuria added a subscriber: Asaf.Jul 5 2019, 9:25 PM

Some recent work on this.
@JFishback_WMF is working on risk assessment framework with legal that we can apply to data releases such as this one. I took a look at the data harvested after the major refactor of the data harvesting jobs. As fas i can see on the daily edits tally of eswiki and arwiki about half of edits daily (agreggated for all countries) come from anonymous-editors. Once we aggregate the data monthly this ratio changes dramatically. There are about 5% authenticated editors and 95% anonymous editors on the monthly tally (again, aggregated per country). Pinging @Asaf in the per country releases requested are you also thinking about anonymous editors?

Note: I am taking here about "editors" not "edits", the number of "edits" by authenticated users is much greater.

Ijon added a comment.Jul 5 2019, 10:05 PM

Yes, anonymous editors matter too. Though I am mostly interested in the old "active" (>5/month) and "very active" (>100/month) definitions, and have next to no interest in the >1/month group. Presumably a large section of the anonymous editors make only a single edit, or fewer than 5 anyhow.

Nuria added a comment.EditedJul 6 2019, 4:37 AM

@Asaf : that is correct, only about 5% of the anonymous "entities" have more than 5 edits.

Nuria added a comment.EditedJul 10 2019, 11:33 PM

The bucket size requested is 10 for editing data.

This might be counter intuitive but the bucket size of the country has little to do with the privacy associated with being able to geo-locate an editor.

Example:
If there are 10,000 edits for eswiki across 10 countries, and one of those countries , say, Argentina, has 25 people on the 100+ edit bucket you can find easily who are those 25 people by looking at our public edit data (every single edit is public) and looking at how many editors for, eswiki did 100+ edits that month. If there are 50 editors with 100+ edits for eswiki that month and we have reported buckets like (Agentina, 25), (Brazil, 15) (Other, 10) the probability of any one editor selected at random of being in Argentina is 25/50 so 0.5 which is is pretty high.

So, in this case, privacy is a function of the total number of editors in the 100+ bucket plus the distribution across countries of the number of editors. A distribution of (Argentina,40), (other, 10) in the example above would tell you that 4/5 of editors with 100+ edits this month where located in Argentina, so while the country bucket is larger in the second example the probability of localizing an editor to the actual country has not decreased but increased.

An idea would be to report editors (without buckets, just the ones with 5+ edits on content pages) where the probability of identifying the country of an editor to a country is less than a threshold. If legal thinks that there is no issue with precisely identifying editors to a country this distinction does not matter but regardless the requested "bucket size of 10" does not add any "privacy budget" as far as I can see.

Nuria added a comment.Jul 11 2019, 4:09 AM

Proposal:

  • let's release data for editors with 5+ edits per country (regardless of size of bucket) per wiki, let's not release distinctively the 5+ and 100+ buckets
  • some countries in which surveillance is prevalent will be blacklisted and no data will be released. See: https://dash.harvard.edu/bitstream/handle/1/32741922/Wikipedia_Censorship_final.pdf
  • let's not release data for countries whose population is below a threshold, regardless of size of bucket.
Nuria added a comment.Jul 11 2019, 6:09 PM

Per @ezachte's criteria of "live" wikipedias this data should not include dead/un-editable wikipedias,. Erik's word on this regard below:

Leila and I looked into how many wikis have at least a modicum of activity.
Even when we set the threshold for activity quite low, at 3 or more editors doing 5 or more edits a month, still the majority of our wikis doesn't qualify.
In July 2016 only 344 out of 865 wikis qualified.
For the Wikipedias specifically, about 160 meet that very low target, which is stable since 2008
Other projects: https://stats.wikimedia.org/EN/ProjectTrendsActiveWikis.html
BTW we feel the extreme lower limit of 1 or more active editors is, well too extreme.
Wikis only start to come to life for real, when editors collaborate or at least potentially vet each others work.

Nuria reassigned this task from Milimetric to mforns.Jul 23 2019, 4:22 PM
Milimetric added a subscriber: mforns.
Nuria added a comment.Jul 25 2019, 4:59 PM

BTW, my notebook on this is Test_geoeditors_pyspark.ipynb

Quick status update. I am currently evaluating ways to release this data. This is just while we wait for our privacy framework to be finished. As soon as that's done, we can evaluate our possible solutions here and execute the release fairly quickly.

Following up on @Nuria's proposal, @Ijon stated above that he needs the 100+ bucket. We should abide by this request, unless he states otherwise. The only possible compromise I can think of is releasing only whether or not there are 100+ editors, like so:

wikicountry5+ editors100+ editors exist
eswikiArgentina30-40True
eswikiOther10-20False

I also think the privacy analysis in T131280#5322897 should be looking at buckets as follows. In the example where the set of countries of eswiki editors is (Argentina, Other), if the raw data is (Argentina: 36, Other: 12), then the bucketed output would be: (Argentina: 30-40, Other: 10-20). An attacker trying to determine the probability that an editor is in Argentina would have to look at the range from: (Argentina: 30, Other: 19) to: (Argentina: 39, Other: 11). This would make P(editor-in-Argentina) range from 61% to 78%. This is a fairly wide margin, introducing a bit of uncertainty, and it's the most degenerate example. Most active wikis will have five or more countries, so for example if we have:

wikicountry5+ editors
eswikiArgentina10-20
eswikiSpain10-20
eswikiMexico30-40
eswikiColombia10-20
eswikiUSA10-20

Then P(editor-in-Mexico) would range from (Mexico: 30, Argentina: 19, Spain: 19, Colombia: 19, USA: 19) to (Mexico: 39, Argentina: 10, Spain: 10, Colombia: 10, USA: 10) which is 28% to 49%. This is not a rigorous way to look at the data release, and that's one of the things I'm working on now. But it does show some justification for the bucketing approach.

Ijon added a comment.Aug 17 2019, 7:58 AM

Thanks for making progress on this!

I'm afraid a boolean on the 100+ is next to useless, though: practically all wikis have at least that one person hacking away at a 100+ level. It's important for us to be able to determine whether there are, say, at least 10 100+ editors in a country. Being a more robust number than the easily fluctuating 5+ count, it is also the best measure for a local group/affiliate to assess the "community size" (as it, at some point, moves from bucket to bucket).

I don't want to re-hash the conversation we've had several times by now about how we never committed to not revealing a user's country, and how, while ideally it would be impossible to determine, mitigating the risk of it being determined should not *obviously* be prioritized over the very immediate benefit to our communities.

Instead, if it would help get this feature out the door at long last, perhaps we can blacklist a few countries where we have concrete reasons to fear the risk of identifying a contributor's country (e.g. some central Asian dictatorships), and provide bucketed results for the rest?

Thanks very much. I also would much prefer not to rehash the conversations we've already had. We're ready to release this data, and the work we're doing to find the best way to release it is just preparation until the privacy framework is ready. We just want to be able to justify our decisions. Bucketing and blacklisting seem like they fit into the privacy framework drafts I've seen so far, and I'll take some time during paternity leave to fill out that rationale if needed. So, here's where we are so far:

  • bucketing and blacklisting seem like the best approach we've come up with so far (me and others are looking into alternative approaches)
  • we will release 5+ counts and 100+ counts per wiki per country
  • we will do all of this as soon as the privacy framework is available

Unless I'm very much mistaken, the described system will make it possible to determine the country of specific editors, in some situations.

(Is one not supposed to mention specific possible privacy-related attacks on public tasks? If so, how would one bring those up?)

@Yair_rand, that's what we're trying to prevent, yes. The value of the data is great, and the risk will be minimized as much as possible. As Asaf points out above, we have had this conversation for a very long time. Our legal and security teams have thought about the potential danger of this dataset and signed off on us publishing it. Nevertheless, I personally would like to protect this dataset as much as possible and that's why I'm looking into how to make it harder to determine the country of specific editors. Does that make sense? Do you have additional concerns?

I could use a collaboration on the list of countries to blacklist. The paper that Nuria mentions includes: China, Cuba, Egypt, Indonesia, Iran, Kazakhstan, Pakistan, Russia, Saudi Arabia, South Korea, Syria, Thailand, Turkey, Uzbekistan, Vietnam. But the reason for censorship is pretty different in each country, and they don't all seem like they need a blacklist. I tried to guess at a first draft of the blacklist but honestly I'm not sure. The governments in not just those countries but those regions seem pretty troubling to me. And I don't have enough knowledge to know when something goes from troubling to dangerous.

Change 530878 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] [WIP] draft of outputting druid geoeditor queries

https://gerrit.wikimedia.org/r/530878

My latest patchset on that change above is just a draft implementing some of the thoughts so far. It implements the following so that we have a place to start from when we finalize our thoughts on privacy here:

  • respects a blacklist of countries (actual countries belonging in the list TBD, perhaps Trust-and-Safety have a list of countries we can start our blacklist from?)
  • only reports 5 to 99 and 100 or more activity levels
  • only reports numbers for Wikipedia projects or central projects like commons
  • only considers wikis with 3 or more active editors overall (not per country)
  • buckets output to obscure exact numbers and add uncertainty in a probabilistic attack

See the insert script for more details

Unless I'm very much mistaken, the described system will make it possible to determine the country of specific editors, in some situations.

yes, to be clear that is a risk that we can mitigate, not eliminate

(Is one not supposed to mention specific possible privacy-related attacks on public tasks? If so, how would one bring those up?)

on the contrary we welcome those

@Milimetric I think for our first release we can remove all countries mentioned on the surveillance report, we can work on adding them later if we think that to be a safe measure.

Several examples of ways to find out a user's country: (Note: A lot of this depends on how frequently this report will be published.)

  • Data is published, some time passes, during which precisely one new editor has shown up who wasn't there previously. The counter for one country goes up a step from, say, 20-30 to 30-40. This user's country has been revealed.
  • A very small wiki that normally wouldn't have the data published has exactly one editor. A malicious user trying to determine that editor's country creates nine "active" accounts in a particular country. If the counter shows 10+, the country has been revealed. The malicious user can repeat, and also test many countries at once to narrow it down.
  • Combination of the above tactics: A malicious user tries to keep several countries' counters at the boundary, so that either a new user will cause the count to increment, or a leaving user will cause it to decrement. If all users joining in the same duration but one can be ruled out by time-zone data or similar, the remaining user's country has been revealed.
  • There are ten active editors on a project. One of the countries shows 10-20. The country of all of the editors has been revealed.
  • There are 16 active editors on a project. A malicious user creates four active accounts in a country. The counter shows 20-30. The country of all of the editors has been revealed.

@Yair_rand you examples show ingenuity, yet they also seem somewhat contrived. Suppose some malicious geeky and rather obsessed user would go to such length to 'exploit a weakness' in the privacy protection, and they learn about the country of a wikimedian who doesn't want to reveal themselves, how much damage could be done? Say China, with its enormous resources finds out that 16 active editors on a small wiki all edit from Taiwan. How much would they have learned then? Taiwan has 23+ million population. That geeky detective could probably also learn from text analysis (English isn't spoken the same in different countries), from analysis of edit times (where waking hours is a proxy for time zone), from edits being spaced wider apart from countries with low bandwidth. I admit all contrived examples as well, and only effective in combination, and in the hands of a very geeky and obsessed malicious user with infinite resources. It's probably easier for such a geek to infiltrate our security by social engineering, placing a mole, and what have you.

Nuria added a comment.Aug 21 2019, 5:06 PM

As I mentioned on my prior post I do not think is such a great idea to release +5 and 100+ counts separately. Also , to correct what @Milimetric says above per @Erik_Zachte definition we consider an active wikipedia one with 3+ editors with 5+ edits per month.

mforns claimed this task.Aug 23 2019, 3:51 PM

Ideas to lower the number of potential problems:

  • Make the numbers less precise, somehow? Something like making each unique editor have a 10% chance of being skipped in the count and 10% chance of being counted twice, or some such. (This would need to persist per editor between reports, I think. Maybe even do the same thing to the changes in numbers when someone leaves.)
  • Don't generate a new report when only one editor has joined or left since the previous report. Perhaps even per-country, don't update unless there's a change of at least several counted users since the last report.

(Probably stupid question: Aren't there people who actually specialize in this kind of thing, with established methods for preventing leaking data on particular individuals?)

Nuria added a comment.Mon, Sep 9, 2:58 PM

(Probably stupid question: Aren't there people who actually specialize in this kind of thing, with established methods for preventing leaking data on particular individuals?)

Of course, there are many statistical methods to mask data. Now, the base issue is that the requesters of this data (@Asaf and others) think that leaking data on particular users is not an issue and you are contending the opposite. This request is for data on a very granular scale.