Page MenuHomePhabricator

Purge gobblin files
Closed, ResolvedPublic

Description

Gobblin stores various files in hdfs:

  • /wmf/gobblin/metrics/[JOB_FOLDER] -- Each folder contains metrics stored in files from each gobblin tasks (small size)
  • /wmf/gobblin/task_working/[JOB_GROUP]/[JOB_FOLDER] -- Each folder contains empty subfolders used for gobblin during execution.
  • /wmf/gobblin/state_store/[JOB_GROUP] -- Each folder contains many files, each being a job status as JSON in a sequence file.

Note: we have 4 JOB_GROUPs: event_default, eventlogging_legacy, netflow, webrequest.
We need to drop old files regularly, as there are many jobs occuring (webrequest has 6 per hour) and the data is not useful past a few days.
I suggest we keep data for 7 days, which is the usual data-retention in kafka.
One solution could be to move the folders to /wmf/tmp/analytics where it would be cleaned regularly. Another one is to add hdfs-cleaner jobs for those folders (for simplification, we could have a single job using the main folder as base /wmf/gobblin, and cleaning data older than 7 days?)

Related Objects

StatusSubtypeAssignedTask
ResolvedJAllemandou
ResolvedJAllemandou

Event Timeline

Let's just hdfs-cleaner rm them with trash :)

odimitrijevic moved this task from Incoming to Operational Excellence on the Analytics board.

Change 724411 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery/source@master] Remove /wmf/gobblin from hdfs_cleaner disallowlist

https://gerrit.wikimedia.org/r/724411

Change 724412 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update hdfs-cleaner jar for disallowlist change

https://gerrit.wikimedia.org/r/724412

Change 724413 had a related patch set uploaded (by Joal; author: Joal):

[operations/puppet@production] Add analytics purge for Gobblin old files

https://gerrit.wikimedia.org/r/724413

Change 724411 merged by Ottomata:

[analytics/refinery/source@master] Remove /wmf/gobblin from HDFSCleaner disallowlist

https://gerrit.wikimedia.org/r/724411

Change 724412 merged by Razzi:

[analytics/refinery@master] Update hdfs-cleaner jar for disallowlist change

https://gerrit.wikimedia.org/r/724412

Change 724413 merged by Razzi:

[operations/puppet@production] Add analytics purge for Gobblin old files

https://gerrit.wikimedia.org/r/724413

Change 732610 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Absent the Analytics hdfs-cleaner-gobblin timer

https://gerrit.wikimedia.org/r/732610

Change 732610 merged by Elukey:

[operations/puppet@production] Absent the Analytics hdfs-cleaner-gobblin timer

https://gerrit.wikimedia.org/r/732610

Back to "In Progress" to assess whether the deletion script is stable enough and doesn't break Gobblin on a regular basis.

Change 734993 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] Fix bug in HDFSCleaner where directories with only directories would always be deleted

https://gerrit.wikimedia.org/r/734993

Change 734993 merged by jenkins-bot:

[analytics/refinery/source@master] Fix bug in HDFSCleaner where directories with only directories would always be deleted

https://gerrit.wikimedia.org/r/734993

Change 735001 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Update jar version of hdfs-cleaner script

https://gerrit.wikimedia.org/r/735001

Change 735001 merged by Ottomata:

[analytics/refinery@master] Update jar version of hdfs-cleaner script

https://gerrit.wikimedia.org/r/735001

Change 735429 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] re-enable hdfs-cleaner-gobblin

https://gerrit.wikimedia.org/r/735429

Change 735429 merged by Ottomata:

[operations/puppet@production] re-enable hdfs-cleaner-gobblin

https://gerrit.wikimedia.org/r/735429