The Analytics team has to periodically go through the /user directory in HDFS to check if any PII data is kept beyond the retention guidelines. Sometimes we find that users not active anymore still hold data on HDFS (and possibly databases in Hive). It would be great if as part of VerboseOffboard we could also add a step to ensure that no HDFS/Hive-database/etc.. data is retained, unless strictly necessary. This new bit could be applied to only users in certain groups (like analytics-privatedata-users). How does this proposal sound?
Description
Description
Details
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Print group memberships which granted Hadoop access to check for HDFS cleanups | operations/puppet | production | +126 -33 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Restricted Task | |||||
Resolved | None | T200312 Remove data from Hadoop's HDFS as part of the user offboard workflow |
Event Timeline
Comment Actions
The users might leave PII data in the following places:
- /home/$USER dir on the stat boxes
- /user/$USER dir on HDFS
- Hive databases on HDFS
Comment Actions
Change 459558 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Print group memberships which granted Hadoop access to check for HDFS cleanups
Comment Actions
Change 459558 merged by Muehlenhoff:
[operations/puppet@production] Print group memberships which granted Hadoop access to check for HDFS cleanups