Page MenuHomePhabricator

Remove data from Hadoop's HDFS as part of the user offboard workflow
Closed, ResolvedPublic

Description

The Analytics team has to periodically go through the /user directory in HDFS to check if any PII data is kept beyond the retention guidelines. Sometimes we find that users not active anymore still hold data on HDFS (and possibly databases in Hive). It would be great if as part of VerboseOffboard we could also add a step to ensure that no HDFS/Hive-database/etc.. data is retained, unless strictly necessary. This new bit could be applied to only users in certain groups (like analytics-privatedata-users). How does this proposal sound?

Related Objects

StatusSubtypeAssignedTask
ResolvedNone

Event Timeline

elukey created this object with visibility "Custom Policy".
elukey changed the visibility from "Custom Policy" to "Public (No Login Required)".Jul 26 2018, 9:43 AM
herron triaged this task as Medium priority.Aug 3 2018, 9:09 PM

The users might leave PII data in the following places:

  • /home/$USER dir on the stat boxes
  • /user/$USER dir on HDFS
  • Hive databases on HDFS

Change 459558 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Print group memberships which granted Hadoop access to check for HDFS cleanups

https://gerrit.wikimedia.org/r/459558

Change 459558 merged by Muehlenhoff:
[operations/puppet@production] Print group memberships which granted Hadoop access to check for HDFS cleanups

https://gerrit.wikimedia.org/r/459558