One regular set of actions taken by oozie is to archive the output of a job. This means:
- Checking the correctness of the provided "source"
- it should exist
- be a directory
- possibly contain a DONE file flag
- check that there is only a single non-empty file (possibly with pattern matching)
- Create the destination folder if it doesn't exist (with correct Umask)
- Move the single non-empty data file to the destination folder, with a renaming pattern
this code is currently handled by: https://github.com/wikimedia/analytics-refinery/tree/master/oozie/util/archive_job_output
Open question: Do we wish to create a scala or python script to make this happen, or do we wish to embed the code in an Airflow Operator?
I htink that in any case we wish to have an airflow operator, and I'd go for having a dedicated python or scala script for this - but I'm very open to other opinions :)