Page MenuHomePhabricator

Define and implement archiving for Airflow
Closed, ResolvedPublic

Description

One regular set of actions taken by oozie is to archive the output of a job. This means:

  • Checking the correctness of the provided "source"
      • it should exist
      • be a directory
      • possibly contain a DONE file flag
      • check that there is only a single non-empty file (possibly with pattern matching)
    • Create the destination folder if it doesn't exist (with correct Umask)
    • Move the single non-empty data file to the destination folder, with a renaming pattern

this code is currently handled by: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/util/archive_job_output

Open question: Do we wish to create a scala or python script to make this happen, or do we wish to embed the code in an Airflow Operator?
I htink that in any case we wish to have an airflow operator, and I'd go for having a dedicated python or scala script for this - but I'm very open to other opinions :)

Details

Related Changes in Gerrit:

Event Timeline

@Antoine_Quhen -> This has become an implementation task yeah?

Yes, it is also an implementation now.

Change 774383 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Add archiving job

https://gerrit.wikimedia.org/r/774383

Change 774383 merged by jenkins-bot:

[analytics/refinery/source@master] Add archiving job for Airflow

https://gerrit.wikimedia.org/r/774383