Page MenuHomePhabricator

Make aggregate data on editors per country per wiki publicly available
Open, NormalPublic0 Story Points

Tokens
"Love" token, awarded by Pamputt."Love" token, awarded by Amire80."Love" token, awarded by Neil_P._Quinn_WMF."Love" token, awarded by MelodyKramer."Love" token, awarded by Jane023."Love" token, awarded by leila.
Assigned To
Authored By
Nuria, Mar 30 2016

Description

The requestor of this data is Asaf on behalf of Emerging Wikimedia Communities

  • bucketed active and very active counts per country, plus per region, plus global north vs global south -- all current generated by the 'geowiki' code in the private area.
  • archive of these by month

Ability to link to reports permanently so "current (bucketed) active editors for Ghana" can be made a permanent link that would lead to the latest numbers.

The bucket size requested is 10 for editing data. I think analytics team did some work in the past that proved that this data size is too coarse to be released to the public. Asaf mentioned that legal had signed off in this data request but while legal can determine whether it infringes the privacy policy is a lot harder for them to know the security risk that the release of a new dataset poses, as attacks depend on what other datasets are available.

A link to the WIP data access guidelines being worked on now: https://office.wikimedia.org/wiki/User:Csteipp/Data_access_guidelines

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2016, 7:04 PM
Nuria updated the task description. (Show Details)Mar 30 2016, 7:05 PM
Nuria added a subscriber: QChris.
Milimetric triaged this task as Normal priority.Apr 4 2016, 4:40 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.
Nuria added a comment.EditedApr 4 2016, 4:44 PM

The bucket size requested is 10 for editing data. I think analytics team did some work in the past that proved that this data size is too coarse to be released to the public.
Asaf mentioned that legal had signed off in this data request but while legal can determine whether it infringes the privacy policy is a lot harder for them
to know the security risk that the release of a new dataset poses, as attacks depend on what other datasets are available

We would like to know. what is this data used for, what questions does this data need to answer? Let's focus on the requirements rather than what the end product is

Milimetric updated the task description. (Show Details)Apr 4 2016, 10:10 PM
Nuria added a comment.Apr 12 2016, 5:08 PM

Notes from meeting on 2016/04/12. Attending: Dan, Asaf, Neil and Nuria

We focused on use cases for this data and there are the main ones:

  1. Emerging communities team needs knowledge of editors per country as part of their criteria to fund one project or other, this data is used internally. Justification for the grant depends on what community the grant serves (what geographic community)
  1. Wiki Community should have access to the data used to decide on granting of funds to verify criteria. Data should be public but it does not necessarily have to have the same shape than internal data
  1. Need of the grantees themselves or non-grantee chapters to track their community sizes (chapters are geographic)

More context:

Funds are granted on a per country basis thus is crucial to have data on a per country and language basis. For example they might fund projects in Brazil geared towards increasing edits on Brazilian wikipedia. In brazil 30% of edits might be done on English Wikipedia but the Emerging Communities team needs to know data pertaining to Brazilian Wikipedia so both geography and language are important when delivering editor's numbers. Another example: the Austrian chapter might need to know German editors based in Austria to asses progress of initiatives geared towards increasing this number. The growth of German editors in Austria and growth of German Editors overall are two different pieces of information we care about.

Due to privacy concerns we talked about what could be good ways to release this data externally:

  • Can we come up with a good proxy to measure editors in a country? Like a"edits" per country? Or rates of growth of editors? Releasing rates of growth is problematic if rates do not change at all as you have no estimation of absolute number. Also, if variability of rates of growth is huge data is of little value as it might represent a too-small number of editors. Now, it could be that a dataset with rates of growth provides quite a lot of useful info.
  • Could we ask editors to opt in when it comes to sharing their location?
aude added a subscriber: aude.Jul 23 2016, 9:44 PM
Ijon added a comment.Jan 10 2017, 5:11 AM

I believe we discussed this verbally a few months ago, but for the record, let me record answers to Nuria's questions here as well:

  1. No measure I can think of is as robust a measure of the kind of impact we are looking for as the active editor count. "Edits per country", you will find, is a statistic we already have in public, and have had for a decade now. The rate of growth is an interesting metric too, but is not as useful to either determine the situation at a given point in time or to compare two distant points in time (i.e. how are we doing today versus 2013 in country X?), in my opinion. Again, giving our volunteers access to active editor counts (even bucketed, as long as the granularity is enough to notice less than gargantuan change) is the single most helpful analytics tool we can give affiliates and program leaders. It ought to be no less of a priority for us than to be careful about users' privacy.
  1. That would make the metric far, far weaker, but I suppose we could. If we do that and make the wiki invite people, not just upon registration but (at least once) for veteran editors too, to disclose their location, I suppose it could generate a useful metric in 2-3 years. I invite you to explore that possibility with the community and with whoever would own such a feature, product-wise, but I don't consider it a viable solution for our immediate need.

Thank you, @Milimetric, for moving it to Q4. This gives me hope we will finally accomplish this.

leila awarded a token.Jan 24 2017, 7:05 PM
Jane023 added a subscriber: Jane023.

Additional notes and use-cases, from a discussion in IRC:

  • Journalists regularly seek this kind of information, and want to know general statistics for their articles, e.g. the rough numbers in https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerLanguageBreakdown.htm#Portuguese are perfect for them (but that page was last updated in 2013)
  • User groups and documentation writers seek this kind of information, whilst trying to define the necessary background information to understand the needs specific to their communities.
    • knowing the current breakdown and the relative populations and internet penetration can give a good idea of the potential of each country, because some have very different challenges from the others.
  • It would be very useful to have an easy way to see any trends over time, so that the impact of things like outreach programs could perhaps be seen. (Either as a table like https://stats.wikimedia.org/wikimedia/squids/SquidReportPageEditsPerCountryTrends.htm or some sort of graph/visualization)
Nuria moved this task from Wikistats Production to Dashiki on the Analytics board.Apr 24 2017, 2:57 PM
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.Apr 26 2017, 3:26 PM
Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.
Nuria moved this task from Dashiki to Wikistats Production on the Analytics board.Jul 17 2017, 4:17 PM
Ijon added a comment.Jul 23 2017, 9:10 AM

Can someone offer a comment on why it was decided to move it from Q4 of last year to Q3 of this year? As a stakeholder, I wish I were included in the conversation that resulted in deprioritizing this.

Nuria added a comment.EditedJul 23 2017, 3:24 PM

@Ijon : we do not have enough resources to tackle this quarter or next, sorry about that. We think we can benefit from our recent work on edit data lake for this project but we cannot tackle it any sooner.

Items we need to close before tacking this one are the backend for wikistats 2.0 and the improvements to edit data lake such us: https://phabricator.wikimedia.org/T161147

Ijon added a comment.Sep 27 2017, 4:23 PM

Thank you, @Nuria, for the clear update. Do we have a solid solution for the concerns your team has had about leaking a user's country due to insufficient bucketing? I'd like some assurance that come Q3, the feature can finally be implemented, rather than someone discovering then that the privacy concern is still there, and us descending into yet another spiral on that question.
Forgive me if you hear a bitter overtone in the above question, but, really, we've been waiting for this data for over four years now, and it's been more than two years since Legal declared bucketing sufficient, so I am wary of additional delays and am seeking to ensure only unavoidable delays continue to delay us.

@Ijon think after our implementation of editing metrics that we hope to launch beginning on Q2 as part of the new wikistats 2.0 we are going to have more options as to how to compute this data with a level of fuzziness that is still helpful. (cc @Milimetric ) give us some space to get the alpha for new backend ready and then we will be ready to experiement with this on the new datasets on the data store , thus far our raw data for editing analytics does not have geo location (cc @JAllemandou to keep me honest)

Ijon added a comment.Sep 27 2017, 4:53 PM

Thank you for the quick response, @Nuria. It sounds as though there is a worthwhile discussion to be had on whether and how we can achieve the statistics requested in this ticket. Unless I misunderstood what you just said, it sounds as though even in Q3, the geo location would not, in fact, be part of the available data, unless Someone Does Something. So I am asking that Someone Do Something. If you need me to escalate this yet again before Someone can Do Something, please let me know and I shall do so.

It looks like this is now getting priority, so I'd like to get involved and set up a hopefully useful approach. Here's what I'm thinking:

  • Finish the subtask that Nuria just created, which is necessary for us to be able to work with this data. The legacy code has been a major part of the slowdown here, nobody knows it intimately and it's been handed over three times since it was originally written. Moving this to our standard tools and making it a first-class citizen will win us better monitoring and reliability. It will also give a couple of us the context we need to take the next step.
  • For the next step, I propose we don't argue too much about the privacy implications, but implement the simple bucketing that is suggested here, while in the back of our minds thinking about the issue of privacy
  • Once we have the simple bucketing, let's set up a meeting with legal and @Ijon to present it and show any possible attacks on this data that we think of. Hopefully there are none that we can think of, but if there are, this approach should de-mystify the problem for everyone and allow us to talk about it objectively.

Does that sound fair?

Ijon added a comment.Oct 2 2017, 6:11 PM

Thank you, @Milimetric. That would be great.

Milimetric edited projects, added Analytics-Kanban; removed Analytics.
Milimetric set the point value for this task to 0.
Neil_P._Quinn_WMF renamed this task from Making geowiki data public to Make aggregate data on editors per country per wiki publicly available.Mar 5 2018, 7:54 AM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Mar 8 2018, 6:38 PM
Nuria moved this task from Wikistats Production to Incoming on the Analytics board.
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria moved this task from Next Up to Parent Tasks on the Analytics-Kanban board.Mar 21 2018, 5:40 AM
Neil_P._Quinn_WMF rescinded a token.
Neil_P._Quinn_WMF awarded a token.

As T188859 has been merged with this task several months ago, is there any progress on this topic? I think it would be really interesting to have this for the next Wikisats 2.

Nuria added a comment.Sep 30 2018, 4:04 AM

We are now in the process of clarifying with legal team where is the privacy threshold for this data and whether it is OK, on their view, to disclose the geo location of an editor, likely a topic contended in our community.

Is there a particular reason the draft data access guidelines linked above need to be specifically confidential? If not, could they be posted to a public wiki?

Tbayer removed a subscriber: HaeB.Feb 13 2019, 12:53 AM

Somewhat related: T207171: Have a way to show the most popular pages per country.

And I join the people who are asking to make this information available.

Information about countries where the numbers are so small that they can generate privacy problems should be available, too, but it can be restricted to trusted people. ("Trusted" doesn't necessarily mean "WMF staff"; it can be anyone with good intentions, understanding of privacy concerns, and proof of need-to-know.)

Nuria added a comment.Mar 22 2019, 3:00 PM

@Yair_rand The guidelines are not confidential, they apply to a set of systems that only people working at WMF (and research collaborators) have access to , as such it is posted to WMF internal wiki.

@Nuria The WMF internal wiki isn't publicly accessible, making its contents unavailable to the community. Unless there's particular reason the guidelines need to be withheld, I would think that reasonable transparency would require that they not be hidden inside the WMF's private wiki.

Nuria added a comment.Mar 25 2019, 2:56 PM

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

Nuria added a comment.EditedMar 25 2019, 11:10 PM

FYI @Amire80 https://phabricator.wikimedia.org/T207171 has no privacy issues for the most part as the data will be shown only for the highest buckets.

@Yair_rand the public guidelines as to data retention are public in the privacy policy: https://foundation.wikimedia.org/wiki/Privacy_policy

Yes, but the guidelines @Yair_rand mentioned are significantly more detailed about the actual practices people with access to data should follow, and people outside the Foundation have a legitimate interest in knowing what they are, whether to assess the trustworthiness of our data analysis practices or (for volunteers with access to private date) actually follow them.

I've just filed T219542 to work further on this.

Milimetric moved this task from Paused to Done on the Analytics-Kanban board.Apr 2 2019, 4:06 PM
Milimetric moved this task from Done to Paused on the Analytics-Kanban board.
Nuria added a subscriber: Asaf.Fri, Jul 5, 9:25 PM

Some recent work on this.
@JFishback_WMF is working on risk assessment framework with legal that we can apply to data releases such as this one. I took a look at the data harvested after the major refactor of the data harvesting jobs. As fas i can see on the daily edits tally of eswiki and arwiki about half of edits daily (agreggated for all countries) come from anonymous-editors. Once we aggregate the data monthly this ratio changes dramatically. There are about 5% authenticated editors and 95% anonymous editors on the monthly tally (again, aggregated per country). Pinging @Asaf in the per country releases requested are you also thinking about anonymous editors?

Note: I am taking here about "editors" not "edits", the number of "edits" by authenticated users is much greater.

Ijon added a comment.Fri, Jul 5, 10:05 PM

Yes, anonymous editors matter too. Though I am mostly interested in the old "active" (>5/month) and "very active" (>100/month) definitions, and have next to no interest in the >1/month group. Presumably a large section of the anonymous editors make only a single edit, or fewer than 5 anyhow.

Nuria added a comment.EditedSat, Jul 6, 4:37 AM

@Asaf : that is correct, only about 5% of the anonymous "entities" have more than 5 edits.

Nuria added a comment.EditedWed, Jul 10, 11:33 PM

The bucket size requested is 10 for editing data.

This might be counter intuitive but the bucket size of the country has little to do with the privacy associated with being able to geo-locate an editor.

Example:
If there are 10,000 edits for eswiki across 10 countries, and one of those countries , say, Argentina, has 25 people on the 100+ edit bucket you can find easily who are those 25 people by looking at our public edit data (every single edit is public) and looking at how many editors for, eswiki did 100+ edits that month. If there are 50 editors with 100+ edits for eswiki that month and we have reported buckets like (Agentina, 25), (Brazil, 15) (Other, 10) the probability of any one editor selected at random of being in Argentina is 25/50 so 0.5 which is is pretty high.

So, in this case, privacy is a function of the total number of editors in the 100+ bucket plus the distribution across countries of the number of editors. A distribution of (Argentina,40), (other, 10) in the example above would tell you that 4/5 of editors with 100+ edits this month where located in Argentina, so while the country bucket is larger in the second example the probability of localizing an editor to the actual country has not decreased but increased.

An idea would be to report editors (without buckets, just the ones with 5+ edits on content pages) where the probability of identifying the country of an editor to a country is less than a threshold. If legal thinks that there is no issue with precisely identifying editors to a country this distinction does not matter but regardless the requested "bucket size of 10" does not add any "privacy budget" as far as I can see.

Nuria added a comment.Thu, Jul 11, 4:09 AM

Proposal:

  • let's release data for editors with 5+ edits per country (regardless of size of bucket) per wiki, let's not release distinctively the 5+ and 100+ buckets
  • some countries in which surveillance is prevalent will be blacklisted and no data will be released. See: https://dash.harvard.edu/bitstream/handle/1/32741922/Wikipedia_Censorship_final.pdf
  • let's not release data for countries whose population is below a threshold, regardless of size of bucket.
Nuria added a comment.Thu, Jul 11, 6:09 PM

Per @ezachte's criteria of "live" wikipedias this data should not include dead/un-editable wikipedias,. Erik's word on this regard below:

Leila and I looked into how many wikis have at least a modicum of activity.
Even when we set the threshold for activity quite low, at 3 or more editors doing 5 or more edits a month, still the majority of our wikis doesn't qualify.
In July 2016 only 344 out of 865 wikis qualified.
For the Wikipedias specifically, about 160 meet that very low target, which is stable since 2008
Other projects: https://stats.wikimedia.org/EN/ProjectTrendsActiveWikis.html
BTW we feel the extreme lower limit of 1 or more active editors is, well too extreme.
Wikis only start to come to life for real, when editors collaborate or at least potentially vet each others work.