Page MenuHomePhabricator

Purging of sensitive data for WDQS research
Closed, ResolvedPublic

Description

The goal is to update the Oozie job (created per T146064) so it automatically purges data or aggregates/anonymizes it before the 90-day period is over.

  • Review the code and identify what fields need to be purged/updated beyond the 90-day period.
  • The researchers have indicated that they don't need raw IPs. We should update the code to hash IPs, with a salt that we can change every n days (n to be determined).
  • User agent: the research is still at a stage that access to user agent information is helpful. We will most likely have to purge this field for data beyond 90 days, but we can also consider hashing it if it's not too unique from one request to the next. (We should work with Markus and Alex cc-ed to arrive at a solution here.)
  • Identify any other field that needs to be purged/aggregated.

Event Timeline

@leila can you give a short status update on this and if we're approaching any urgent deadline in terms of data retention/removal? Thanks!

@DarTar we don't have PII beyond 90 days. Those are purged. I'm working with Nathaniel to update the Oozie job to hash the IP addresses before storing them (given that Markus et al. don't need the raw IPs) and also look into anonymization/aggregation needed if we are to keep the data beyond 90 days. In any case, until a usable/agreed-upon aggregation strategy is found and implemented, the reearchers are manually purging data as it approaches the 90-day threshold.

leila added a subscriber: AlexKrauseTUD.

@schana per our IRC conversation, I've assigned this task to you. I'll update the task description with more details now.

@Nuria, is this an appropriate example to follow for how to hash and only keep the past 90 days of data? Also, is there an established method for adding a salt in the hash?

Lastly, are there any other sensitive fields that need purged?

@schana: no, sorry, that example has little to do with data rentention. Hashing can be done with hive HASH function. In this case we are not doing this for privacy (it adds none) but rather to avoid cut & paste errors of raw ips. If you need to hash with a salt is more complicated as you need a way to manage the salt. Could you do away w/o IPs completely? That would make your life a lot easier.

Both uas and Ips need to be purged after 90 days, if you need to preserve other data you are going to need to have a new job altogether that moves non-PII fields into other tables. Plus code to drop older data. See code here that drops older partitions: https://github.com/wikimedia/analytics-refinery/tree/master/bin

Hopefully this makes sense. Ping us if it doesn't.

@schana: no, sorry, that example has little to do with data rentention. Hashing can be done with hive HASH function. In this case we are not doing this for privacy (it adds none) but rather to avoid cut & paste errors of raw ips. If you need to hash with a salt is more complicated as you need a way to manage the salt. Could you do away w/o IPs completely? That would make your life a lot easier.

@Nuria The research is still in its early days. We have learned that we don't need raw IPs by now, but at the moment, the researchers are using IPs and UAs as two options for studying buckets of requests. If we had a unique ID for these requests, we wouldn't need to keep (hashed) IP and UA, so if you have suggestions there, let's consider those.

@leila: Regarding data retention there is no difference when it comes to hashing, Data would need to be deleted at the 90 day mark just the same as hashing just prevents humnan cut & paste errors but it does not increase the privacy of data.

To compute an ID to use as identifier of the request while data is live (90 day mark) you can hash(ip, user_agent, accept_language, uri_host) as this can be your id for a session. Again, whether we use tokens or not would not matter for data retention, granular data would need to be deleted.

@leila: Regarding data retention there is no difference when it comes to hashing, Data would need to be deleted at the 90 day mark just the same as hashing just prevents humnan cut & paste errors but it does not increase the privacy of data.

Is this the case even if the salt changes before the end of the 90 day deadline? If the salt changes, isn't the assumption that the raw IP cannot be retrieved anyway? or that's not reliable enough?

To compute an ID to use as identifier of the request while data is live (90 day mark) you can hash(ip, user_agent, accept_language, uri_host) as this can be your id for a session. Again, whether we use tokens or not would not matter for data retention, granular data would need to be deleted.

yup. agreed.

Is this the case even if the salt changes before the end of the 90 day deadline? If the salt changes, isn't the assumption that the raw IP cannot be retrieved anyway? or that's not reliable enough?

We do not have such a system in place (rotating salt) but yes, even in that case. A 90 day salt means that you still are assigning hashes to requests some which might be constant for all those 90 days which prevent us from retaining those records long term.The IP cannot be retrieved with a proper HMAC but the browser session (all records attached to a hash) could very well be.

Yeah. I see what you say, @Nuria. Thanks!

@mkroetzsch for the current state of research, can you live with IPs and UAs being purged after 90 days? If this is a constraint, let's set up a call for me to understand the use-case more and see if we can find another way to store useful information for you that are not PII. If we assess that there is no way to do the research properly without IP and UA over a longer (than 90 days) period of time, we will have to request an exception for this research from Legal.

@leila Can you address whether non-PII data needs to be retained past 90 days?

@Nuria How is the code to drop older partitions invoked?

@schana let's keep all non-PII data past 90 days.

Expected ETA: end of current week (depending on availability on Analytics' end).

Wait, is there anything for us to do here? My impression was that @schana was handling changes> let us know if that is not the case.

Wait, is there anything for us to do here? My impression was that @schana was handling changes> let us know if that is not the case.

I think review will be needed by Analytics before code is merged.

Change 335211 had a related patch set uploaded (by Nschaaf):
(in progress) Store anonymized and purge sensitive data for WDQS

https://gerrit.wikimedia.org/r/335211

I've pushed an initial attempt, but need some guidance on how to invoke the purge script as well as testing the job (I didn't have access last time, but do now).

Change 335437 had a related patch set uploaded (by Nschaaf):
(in progress) Drop wdqs_extract partitions older than 90 days

https://gerrit.wikimedia.org/r/335437

Change 335211 abandoned by Nschaaf:
(in progress) Store sanitized data for WDQS

Reason:
Per the researchers, long term retention of data is unneeded.

https://gerrit.wikimedia.org/r/335211

The puppet change to add the cron for deleting partitions older than 90 days from the wdqs_extract table is ready for review. https://gerrit.wikimedia.org/r/#/c/335437/

@Nuria @Ottomata Please let me know if anything else is required before the puppet patch can be reviewed/merged.

Ah, @schana thanks, will look at this today.

Change 335437 merged by Ottomata:
Drop wdqs_extract partitions older than 90 days

https://gerrit.wikimedia.org/r/335437

Change 337638 had a related patch set uploaded (by Ottomata):
Add --partition-type option to refinery-drop-hourly-partitions script

https://gerrit.wikimedia.org/r/337638

Change 337638 merged by Ottomata:
Add --partition-type option to refinery-drop-hourly-partitions script

https://gerrit.wikimedia.org/r/337638

Change 337639 had a related patch set uploaded (by Ottomata):
Use --partition-type hive for refinery-drop-wdqs-extract-partitions job

https://gerrit.wikimedia.org/r/337639

Change 337639 merged by Ottomata:
Use --partition-type hive for refinery-drop-wdqs-extract-partitions job

https://gerrit.wikimedia.org/r/337639

Did a manual first run:

2017-02-14T20:00:07 INFO   Dropping 1845 partitions from table wmf.wdqs_extract

But! Ah, this script did not work for hive-style partitions. I submitted a couple of changes to make sure it also deleted directories from HDFS. And

2017-02-14T20:39:36 INFO   Removing 1845 partition directories for table wmf.wdqs_extract from /wmf/data/wmf/wdqs_extract.

Great!