Purging of sensitive data for WDQS research
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	leila
	Sep 28 2016, 4:29 PM

Description

The goal is to update the Oozie job (created per T146064) so it automatically purges data or aggregates/anonymizes it before the 90-day period is over.

Review the code and identify what fields need to be purged/updated beyond the 90-day period.
The researchers have indicated that they don't need raw IPs. We should update the code to hash IPs, with a salt that we can change every n days (n to be determined).
User agent: the research is still at a stage that access to user agent information is helpful. We will most likely have to purge this field for data beyond 90 days, but we can also consider hashing it if it's not too unique from one request to the next. (We should work with Markus and Alex cc-ed to arrive at a solution here.)
Identify any other field that needs to be purged/aggregated.

Details

Subject	Repo	Branch	Lines +/-
Use --partition-type hive for refinery-drop-wdqs-extract-partitions job	operations/puppet	production	+1 -1
Add --partition-type option to refinery-drop-hourly-partitions script	analytics/refinery	master	+13 -8
Drop wdqs_extract partitions older than 90 days	operations/puppet	production	+11 -1
(in progress) Store sanitized data for WDQS	analytics/refinery	master	+214 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	leila	T135083 Create a formal collaboration for WDQS research
Resolved	• Nuria	T146064 Oozie job to extract data for WDQS research
Resolved	• schana	T146915 Purging of sensitive data for WDQS research

Event Timeline

leila created this task.Sep 28 2016, 4:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 28 2016, 4:29 PM

leila added a parent task: T146064: Oozie job to extract data for WDQS research.Sep 28 2016, 4:29 PM

leila mentioned this in T146064: Oozie job to extract data for WDQS research.

leila added a subscriber: • Nuria.

leila moved this task from Backlog to Time Sensitive on the Research board.Nov 7 2016, 7:27 PM

@leila can you give a short status update on this and if we're approaching any urgent deadline in terms of data retention/removal? Thanks!

@DarTar we don't have PII beyond 90 days. Those are purged. I'm working with Nathaniel to update the Oozie job to hash the IP addresses before storing them (given that Markus et al. don't need the raw IPs) and also look into anonymization/aggregation needed if we are to keep the data beyond 90 days. In any case, until a usable/agreed-upon aggregation strategy is found and implemented, the reearchers are manually purging data as it approaches the 90-day threshold.

@schana per our IRC conversation, I've assigned this task to you. I'll update the task description with more details now.

leila updated the task description. (Show Details)Jan 11 2017, 8:19 PM

@Nuria, is this an appropriate example to follow for how to hash and only keep the past 90 days of data? Also, is there an established method for adding a salt in the hash?

Lastly, are there any other sensitive fields that need purged?

@schana: no, sorry, that example has little to do with data rentention. Hashing can be done with hive HASH function. In this case we are not doing this for privacy (it adds none) but rather to avoid cut & paste errors of raw ips. If you need to hash with a salt is more complicated as you need a way to manage the salt. Could you do away w/o IPs completely? That would make your life a lot easier.

Both uas and Ips need to be purged after 90 days, if you need to preserve other data you are going to need to have a new job altogether that moves non-PII fields into other tables. Plus code to drop older data. See code here that drops older partitions: https://github.com/wikimedia/analytics-refinery/tree/master/bin

Hopefully this makes sense. Ping us if it doesn't.

In T146915#2966672, @Nuria wrote:

@schana: no, sorry, that example has little to do with data rentention. Hashing can be done with hive HASH function. In this case we are not doing this for privacy (it adds none) but rather to avoid cut & paste errors of raw ips. If you need to hash with a salt is more complicated as you need a way to manage the salt. Could you do away w/o IPs completely? That would make your life a lot easier.

@Nuria The research is still in its early days. We have learned that we don't need raw IPs by now, but at the moment, the researchers are using IPs and UAs as two options for studying buckets of requests. If we had a unique ID for these requests, we wouldn't need to keep (hashed) IP and UA, so if you have suggestions there, let's consider those.

@leila: Regarding data retention there is no difference when it comes to hashing, Data would need to be deleted at the 90 day mark just the same as hashing just prevents humnan cut & paste errors but it does not increase the privacy of data.

To compute an ID to use as identifier of the request while data is live (90 day mark) you can hash(ip, user_agent, accept_language, uri_host) as this can be your id for a session. Again, whether we use tokens or not would not matter for data retention, granular data would need to be deleted.

In T146915#2967911, @Nuria wrote:

@leila: Regarding data retention there is no difference when it comes to hashing, Data would need to be deleted at the 90 day mark just the same as hashing just prevents humnan cut & paste errors but it does not increase the privacy of data.

Is this the case even if the salt changes before the end of the 90 day deadline? If the salt changes, isn't the assumption that the raw IP cannot be retrieved anyway? or that's not reliable enough?

To compute an ID to use as identifier of the request while data is live (90 day mark) you can hash(ip, user_agent, accept_language, uri_host) as this can be your id for a session. Again, whether we use tokens or not would not matter for data retention, granular data would need to be deleted.

yup. agreed.

Is this the case even if the salt changes before the end of the 90 day deadline? If the salt changes, isn't the assumption that the raw IP cannot be retrieved anyway? or that's not reliable enough?

We do not have such a system in place (rotating salt) but yes, even in that case. A 90 day salt means that you still are assigning hashes to requests some which might be constant for all those 90 days which prevent us from retaining those records long term.The IP cannot be retrieved with a proper HMAC but the browser session (all records attached to a hash) could very well be.

Yeah. I see what you say, @Nuria. Thanks!

@mkroetzsch for the current state of research, can you live with IPs and UAs being purged after 90 days? If this is a constraint, let's set up a call for me to understand the use-case more and see if we can find another way to store useful information for you that are not PII. If we assess that there is no way to do the research properly without IP and UA over a longer (than 90 days) period of time, we will have to request an exception for this research from Legal.

@leila Can you address whether non-PII data needs to be retained past 90 days?

@Nuria How is the code to drop older partitions invoked?

@schana let's keep all non-PII data past 90 days.

Expected ETA: end of current week (depending on availability on Analytics' end).

Wait, is there anything for us to do here? My impression was that @schana was handling changes> let us know if that is not the case.

In T146915#2983112, @Nuria wrote:

Wait, is there anything for us to do here? My impression was that @schana was handling changes> let us know if that is not the case.

I think review will be needed by Analytics before code is merged.

Change 335211 had a related patch set uploaded (by Nschaaf):
(in progress) Store anonymized and purge sensitive data for WDQS

https://gerrit.wikimedia.org/r/335211

gerritbot added a project: Patch-For-Review.Jan 31 2017, 11:35 AM

I've pushed an initial attempt, but need some guidance on how to invoke the purge script as well as testing the job (I didn't have access last time, but do now).

Please take a look at oozie docs on how to test: https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie

Change 335437 had a related patch set uploaded (by Nschaaf):
(in progress) Drop wdqs_extract partitions older than 90 days

https://gerrit.wikimedia.org/r/335437

Change 335211 abandoned by Nschaaf:
(in progress) Store sanitized data for WDQS

Reason:
Per the researchers, long term retention of data is unneeded.

https://gerrit.wikimedia.org/r/335211

The puppet change to add the cron for deleting partitions older than 90 days from the wdqs_extract table is ready for review. https://gerrit.wikimedia.org/r/#/c/335437/

@Nuria @Ottomata Please let me know if anything else is required before the puppet patch can be reviewed/merged.

Ah, @schana thanks, will look at this today.

Change 335437 merged by Ottomata:
Drop wdqs_extract partitions older than 90 days

https://gerrit.wikimedia.org/r/335437

Change 337638 had a related patch set uploaded (by Ottomata):
Add --partition-type option to refinery-drop-hourly-partitions script

https://gerrit.wikimedia.org/r/337638

Change 337638 merged by Ottomata:
Add --partition-type option to refinery-drop-hourly-partitions script

https://gerrit.wikimedia.org/r/337638

Change 337639 had a related patch set uploaded (by Ottomata):
Use --partition-type hive for refinery-drop-wdqs-extract-partitions job

https://gerrit.wikimedia.org/r/337639

Change 337639 merged by Ottomata:
Use --partition-type hive for refinery-drop-wdqs-extract-partitions job

https://gerrit.wikimedia.org/r/337639

Did a manual first run:

2017-02-14T20:00:07 INFO   Dropping 1845 partitions from table wmf.wdqs_extract

But! Ah, this script did not work for hive-style partitions. I submitted a couple of changes to make sure it also deleted directories from HDFS. And

2017-02-14T20:39:36 INFO   Removing 1845 partition directories for table wmf.wdqs_extract from /wmf/data/wmf/wdqs_extract.

Great!

• schana removed a project: Patch-For-Review.Feb 15 2017, 6:29 PM

• schana moved this task from Time Sensitive to Done (current quarter) on the Research board.

• ggellerman closed this task as Resolved.Jul 17 2017, 6:34 PM

• ggellerman edited projects, added Research-Archive; removed Research.

• DarTar moved this task from Default to Q1-Q4-FY17 on the Research-Archive board.Jan 8 2018, 11:59 PM

Purging of sensitive data for WDQS researchClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Purging of sensitive data for WDQS research
Closed, ResolvedPublic
Actions

Related Objects
Search...