The goal is to update the Oozie job (created per T146064) so it automatically purges data or aggregates/anonymizes it before the 90-day period is over.
- Review the code and identify what fields need to be purged/updated beyond the 90-day period.
- The researchers have indicated that they don't need raw IPs. We should update the code to hash IPs, with a salt that we can change every n days (n to be determined).
- User agent: the research is still at a stage that access to user agent information is helpful. We will most likely have to purge this field for data beyond 90 days, but we can also consider hashing it if it's not too unique from one request to the next. (We should work with Markus and Alex cc-ed to arrive at a solution here.)
- Identify any other field that needs to be purged/aggregated.