After the latest additions to it, the netflow data set contains a pair of privacy-sensitive fields.
This means we should either delete or sanitize this data set after 90 days after collection.
This task is to implement such deletion/sanitization for the netflow data in Hive/HDFS.
This task won't affect the netflow data in Druid.
- Delete the data completely after 90 days. This would be done by setting up a timer in puppet::data_purge.pp. Quite easy, but big downside: we'd lose the data very early.
- Sanitize the data using a process similar to EventLoggingSanitization. This would nullify the privacy-sensitive fields after 90 days and keep the rest intact indefinitely. This approach carries a bit of work, but if we implement it generically, it would be useful for other data sets in the same database that need sanitization. Will comment on this task with ideas of a solution.