Page MenuHomePhabricator

[Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style)
Closed, ResolvedPublic

Description

We should evaluate, discuss and decide which way we want to omplement our dag/task dependencies.
We can keep track of the findings and discuss here in this task async.

Event Timeline

@JAllemandou @Antoine_Quhen
This is the task where we can continue discussions on success files vs. dag dependencies!

OK, I'm going to start this conversation :]

I argue in favor of success files.

  • We need a mechanism to trigger jobs on the presence of success files in any case, for data sets generated outside of Airflow, and also while the Oozie migration is happening.
  • Success files are a technology-agnostic way of keeping state.
  • Intuitively success files seem more robust to me. Keeping state in Airflow's database might be affected by changes in DAGs, like changing the dag_id, task_id, maybe order of tasks, deleting a dag, etc. Whereas keeping state in an HDFS file seems pretty robust.
  • Spark works with success files, and Spark is the tool that should be most used in Airflow.

Should we use them always, by default? I'd say let's not. Let's add them as needed.

  • Success files created by default might add unnecessary complexity to our DAGs.
  • We can always add them later if necessary, without big refactors.
  • Not sure we can write success files automatically, not all success files should go to the same place, some go at the leaf level in the directory tree, some others go to a parent directory.

Thoughts!?!?!?

Some good points about using DAG dependencies are :

  • When 1 job triggers another one with push. There is less time waiting between them, and the process is provided by Airflow.
  • It allows cascading triggering of jobs to build or rebuild the dependent datasets.

Because HDFS may not be the end of the road (object store ?), we may consider:

  • Kafka to store events about job processes
  • Or another data store (Redis ?)
  • Or datahub could be used to keep track of the state of the datasets

Anyhow, the push mechanism and the cascading effect could be computed outside of Airflow.

Another point about success files: it does not carry much information about the state of the current directory, like the date of creation, the dates of updates, the process used to create them, the time taken, the trigger, ...
And they are sometimes updated with recursive touch. In this case, we lose information.

All in all, I think we could have some canary jobs implementing DAG dependencies. We are always surprised by Airflow, in good and in bad. ;)

Should we use them always, by default? I'd say let's not. Let's add them as needed.

+1. It might be nice to always add them at the end of a DAG output, in places where that makes sense. Then other unforeseen jobs could potentially rely on them. Writing the _SUCCESS files does not mean you can't use inter DAG dependencies too!

Another point about success files: it does not carry much information about the state of the current directory

Refine writes the 'refined at' binary timestamp into its _REFINED status file. I think that most Hadoop-y things ignore files starting with underscore, but I'm not actually sure if everything does or what mechanism does that. But, if that is true, we could choose to carry more state in the _SUCCESS file, perhaps even some JSON!

I like the idea of not relying on Airflow for jobs dependency as this would put us in a very airflow-coupled position.

Also, the triggers we choose don't need to be _SUCCESS files, we have for instance started using Hive partitions as triggers for many jobs.

About the benefits of using DAG-dependency, the time aspect (using push instead of pull) is I think neglectable for us. The cascading triggering is however very interesting! If we can casacde-rerun all children DAGs of a given DAG, this would be a reason for me to sign up on using DAG-dependency in addition to data-triggers :)

Ya, which is maybe a reason to do both! Especially for the final output of a DAG. Some jobs (like those outside of our own airflow instance?) can use the _SUCCESS files, while others can use DAG deps.

It allows cascading triggering of jobs to build or rebuild the dependent datasets.

Agree, that would be really valuable!

So, as a recap:

  • It seems, we need the ability for Airflow jobs to generate success files in any case. Plus, we already implemented and deployed the URLTouchOperator, which does exactly that.
  • What remains to be seen is whether we still want to use DAG/task dependencies. If cascading triggering of jobs works well for us, that would be a sufficient reason to use them.

Conclusion:

  • I think we should prioritize at some point to do a time-boxed spike to test cascading of DAG/task dependencies. If that works like we expect, we can go ahead and modify the existing DAGs that would benefit from DAG/task dependencies. And also document the use of DAG/task dependencies on Wikitech, so that new jobs can make proper use of them.

Adding something to check while looking at DAG/Task dependencies: how those feed into data catalog.

I think the decision depends on a research that we have not yet done. We should do a time-boxed spike to test cascading of DAG/task dependencies and taking into consideration how those feed into data catalog.
Alternatively, we could just continue the migration without creating DAG/task dependencies (like we've done so far), and then in the future, if we need to improve the pipeline dependency management, we can do the spike and modify the jobs if feasible!
I lean a bit towards the latter, but let me know team what you think!

lbowmaker claimed this task.