Page MenuHomePhabricator

Ensure discovery.query_clicks_* data is purged per privacy policy
Closed, ResolvedPublic

Description

The click data is unaggregated and includes PII, it needs to be deleted after 90 days to match our privacy policy. The hourly table is maintained by hive and typically has only a day of data. The daily table is, currently, pruned by manually calling a script and is done intermittently.

Task is to make patches necessary so the script is called automatically on a daily basis. We might as well purge both the hourly and daily tables from the script just incase our oozie pipeline fails to delete some hours from hive (has happened before).

Event Timeline

Change 419949 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[analytics/refinery@master] Support hourly or daily partition dropping

https://gerrit.wikimedia.org/r/419949

Change 419949 abandoned by EBernhardson:
Support hourly or daily partition dropping

Reason:
after some testing it looks like refinery-drop-hive-partitions will work to do exactly what this patch is trying to change the script to do.

https://gerrit.wikimedia.org/r/419949

Change 419954 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Drop query_clicks partitions after 90 days

https://gerrit.wikimedia.org/r/419954

Once the latest refinery version is deployed we can merge the cron that regularly drops old data.

debt triaged this task as Medium priority.May 1 2018, 5:27 PM

Checked and the code was deployed, but there was a new bug in refinery preventing the partition drop from working. https://gerrit.wikimedia.org/r/438034 has been merged and once deployed we should be ready to go.

EBernhardson added a subscriber: Gehel.

@Gehel It looks like the analytics deploy has happened, i tested the script today and everything seems to work great. The linked gerrit patch for puppet (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/419954/) should now be ready for deployment

Change 419954 merged by Gehel:
[operations/puppet@production] Drop query_clicks partitions after 90 days

https://gerrit.wikimedia.org/r/419954