Page MenuHomePhabricator

Privacy Policy Review for Global South Wikidata edits and active editors datasets
Closed, ResolvedPublic

Description

For the WMDE's Global South Contributions to Wikidata we need a data review in respect to our privacy policy before our Wikidata Analytics PM @Manuel can share the data with third parties. Please see the ticket description for Global South Contributions to Wikidata.

The data are .csv files that are found on our stat1005 Analytics Client in /home/goransm/Analytics/adhoc/WD_GlobalSouth_202109/_analytics/.

The code used to produce the data is here.

We have used the Analytics/Data Lake/Edits/Geoeditors table from the WMF Data Lake to produce the datasets.

The descriptions of the datasets are found here and here.

We can be reached here or via e-mail:

  • goran.milovanovic_ext@wikimedia.de (Goran S. Milovanović, Data Scientist for Wikidata, WMDE)
  • manuel.merz@wikimedia.de (Manuel Merz, Wikidata Analytics PM, WMDE).

Thanks. @Manuel is hoping that this data can be shared in relation to the Reimagining Wikidata from the margins process until the end of this week.

Event Timeline

We are forwarding this request to Privacy Engineering but please don't hesitate to loop us back in if you have questions.

JFishback_WMF triaged this task as Medium priority.
JFishback_WMF moved this task from Incoming to Backlog on the Privacy Engineering board.

Hi @GoranSMilovanovic and @Manuel! My name is Hal — I'm a privacy engineer on the Privacy Engineering team. There is some precedent for releasing data of this variety, but I still have a couple of questions:

  • Which nations encompass the "Global South"? Will they be aggregated together in their entirety (e.g. "In July 2021, the W editors in the Global North made X edits and Y editors in the Global South made Z edits") or split up by country?
    • This is a key question because one of the major mitigations within the publicly-available geoeditors dataset is to suppress the release of data from several countries (some of which may be a part of the Global South) that are potentially dangerous for journalists or internet freedom. In the case that you want to release country-month-editor-edit data, I would recommend the removal of these countries.
  • Do you want this data to be released as exact counts? One other major mitigation from geoeditors_public is to release ranges rather than exact numbers (e.g. 14 edits in Algeria in January —> 10-20 edits in Algeria in January), and to not release data below a certain threshold. This gets less important the more aggregated data is, but is particularly important when small edit counts/editor communities could be released.
  • Who exactly are the "third parties" mentioned above that this data will be released to?

@Htriedman

I think @Manuel as a Wikidata Analytics PM is the right person to ask.
He is currently on a leave, but I guess he will respond as soon as he gets back.

Thank you and stay in touch.

Hi @Htriedman! Thank you for your answer!

Who exactly are the "third parties" mentioned above that this data will be released to?

This information was requested by the organizers of Reimagining Wikidata from the margins.

Do you want this data to be released as exact counts?

Exact counts are not needed.

Which nations encompass the "Global South"? Will they be aggregated together or split up by country?

There is no fixed definition for Global South. Aggregated data will be split up by country.

To be on the safe side, we will make sure that we remove all information from countries on the Country Protection List when showing information outside of the NDA group.

Just for my understanding: Assuming that individuals were well protected by aggregation alone, would you still recommend the removal of countries from the protection list?

Hi @Manuel — so sorry for the late response; my phabricator account was misconfigured and I didn't get a notification email. Thanks so much for getting back to me with all this information.

Unfortunately, even with aggregation that partially obscures data, the privacy team still would mandate the removal of countries from the protection list. Data that is aggregated/anonymized in this way is vulnerable to re-identification attacks, and it is hard to meaningfully define "well protected". More modern methods to release this data are in the works now, but still at a proof-of-concept/beta stage.

Regardless, I'll have this privacy review done by EOD today, and be sure to share it with you. Thanks again.

@Htriedman

Maybe I am missing something:

... even with aggregation that partially obscures data, the privacy team still would mandate the removal of countries from the protection list.

but I think that @Manuel has already explained that we would do that anyways in T291186#7374112:

To be on the safe side, we will make sure that we remove all information from countries on the Country Protection List when showing information outside of the NDA group.

Please clarify, and thank you very much for your review.

Hi @GoranSMilovanovic — apologies for the confusion. I understand that you are intending to remove all informations from countries on the Country Protection List, and was trying to respond to @Manuel's follow-up question, just for the sake of learning:

Just for my understanding: Assuming that individuals were well protected by aggregation alone, would you still recommend the removal of countries from the protection list?

As a general rule, sensitive raw data that is aggregated into buckets (like in this instance) is susceptible to re-identification/linkage attacks, regardless of the size of the bucket. Because that is the case, it is hard to formally define what "well protected by aggregation" means for an editor in one of the countries on the protection list.

@Htriedman

Ok, so just let me check if I understand this perfectly: we will be able to share the files as they are now present in /home/goransm/Analytics/adhoc/WD_GlobalSouth_202109/_analytics/ on our stat1005 Analytics Clients following the removal of countries from the Country Protection List - or there is something else that I need to do with the datasets in order for us to comply to our Privacy Policy?

@GoranSMilovanovic

Just took a look. Those files are alright to share with people who have signed NDA with the Foundation. They are not ok to share publicly, since they contain exact counts of editors and edits, rather than aggregated buckets of counts (11-20 editors instead of 14, 100-200 edits instead of 151, etc.).

Following along with the schema established when releasing geoeditors/public, the data released publicly might look like this:

country codemontheditorseditors (ns0)editsedits (ns0)
FR01-202151-6041-50401-500401-500
ES06-202131-4031-40201-300201-300
AR03-20211-101-105-1005-100

Rows with edit counts under 5 should also not be included in the released tabular data, as they are easily reidentifiable.

@Htriedman

Thank you for being so precise, it really matters in this case!

Ok - I am getting back in touch here as soon as I have the rework of the datasets in place.

@Htriedman @Manuel

  • Files from /home/goransm/Analytics/adhoc/WD_GlobalSouth_202109/_analytics/ are now shared with @Manuel who has an NDA signed with the WMF.
  • Now producing the datasets as described in T291186#7385786 which could then be shared with third parties.

@Htriedman

Please check gs_edits_PUBLIC.csv and gs_active_editors_PUBLIC.csv in /home/goransm/Analytics/adhoc/WD_GlobalSouth_202109/_analytics/: the datasets should now be in compliance with what is described in T291186#7385786. Thank you!

@Manuel As soon as we have a confirmation from @Htriedman I am sharing the files mentioned here with you; in my understanding, you will be able to share them outside of the NDA group.

@GoranSMilovanovic @Manuel

Just checked the datasets that are going to be made available for public release. Everything is in order and you're all set to share them outside of the NDA group. With the mitigations taken, the residual risk level that this data poses to editors is low.

More details on the privacy review and risk assessment here.

@Htriedman Hal, thank you very much for your help and guidance around this issue.
@Manuel I will share the datasets with you via e-mail.

Closing the ticket as resolved.