Gobblin stores various files in hdfs:
- /wmf/gobblin/metrics/[JOB_FOLDER] -- Each folder contains metrics stored in files from each gobblin tasks (small size)
- /wmf/gobblin/task_working/[JOB_GROUP]/[JOB_FOLDER] -- Each folder contains empty subfolders used for gobblin during execution.
- /wmf/gobblin/state_store/[JOB_GROUP] -- Each folder contains many files, each being a job status as JSON in a sequence file.
Note: we have 4 JOB_GROUPs: event_default, eventlogging_legacy, netflow, webrequest.
We need to drop old files regularly, as there are many jobs occuring (webrequest has 6 per hour) and the data is not useful past a few days.
I suggest we keep data for 7 days, which is the usual data-retention in kafka.
One solution could be to move the folders to /wmf/tmp/analytics where it would be cleaned regularly. Another one is to add hdfs-cleaner jobs for those folders (for simplification, we could have a single job using the main folder as base /wmf/gobblin, and cleaning data older than 7 days?)