failure files (Oozie style)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	mforns
	Feb 11 2022, 4:14 PM

Description

We should evaluate, discuss and decide which way we want to omplement our dag/task dependencies.
We can keep track of the findings and discuss here in this task async.

Event Timeline

mforns created this task.Feb 11 2022, 4:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 11 2022, 4:14 PM

@JAllemandou @Antoine_Quhen
This is the task where we can continue discussions on success files vs. dag dependencies!

• EChetty added a project: Data-Engineering-Kanban.Feb 24 2022, 6:04 PM

• EChetty moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.

OK, I'm going to start this conversation :]

I argue in favor of success files.

We need a mechanism to trigger jobs on the presence of success files in any case, for data sets generated outside of Airflow, and also while the Oozie migration is happening.
Success files are a technology-agnostic way of keeping state.
Intuitively success files seem more robust to me. Keeping state in Airflow's database might be affected by changes in DAGs, like changing the dag_id, task_id, maybe order of tasks, deleting a dag, etc. Whereas keeping state in an HDFS file seems pretty robust.
Spark works with success files, and Spark is the tool that should be most used in Airflow.

Should we use them always, by default? I'd say let's not. Let's add them as needed.

Success files created by default might add unnecessary complexity to our DAGs.
We can always add them later if necessary, without big refactors.
Not sure we can write success files automatically, not all success files should go to the same place, some go at the leaf level in the directory tree, some others go to a parent directory.

Thoughts!?!?!?

Some good points about using DAG dependencies are :

When 1 job triggers another one with push. There is less time waiting between them, and the process is provided by Airflow.
It allows cascading triggering of jobs to build or rebuild the dependent datasets.

Because HDFS may not be the end of the road (object store ?), we may consider:

Kafka to store events about job processes
Or another data store (Redis ?)
Or datahub could be used to keep track of the state of the datasets

Anyhow, the push mechanism and the cascading effect could be computed outside of Airflow.

Another point about success files: it does not carry much information about the state of the current directory, like the date of creation, the dates of updates, the process used to create them, the time taken, the trigger, ...
And they are sometimes updated with recursive touch. In this case, we lose information.

All in all, I think we could have some canary jobs implementing DAG dependencies. We are always surprised by Airflow, in good and in bad. ;)

Should we use them always, by default? I'd say let's not. Let's add them as needed.

+1. It might be nice to always add them at the end of a DAG output, in places where that makes sense. Then other unforeseen jobs could potentially rely on them. Writing the _SUCCESS files does not mean you can't use inter DAG dependencies too!

Another point about success files: it does not carry much information about the state of the current directory

Refine writes the 'refined at' binary timestamp into its _REFINED status file. I think that most Hadoop-y things ignore files starting with underscore, but I'm not actually sure if everything does or what mechanism does that. But, if that is true, we could choose to carry more state in the _SUCCESS file, perhaps even some JSON!

I like the idea of not relying on Airflow for jobs dependency as this would put us in a very airflow-coupled position.

Also, the triggers we choose don't need to be _SUCCESS files, we have for instance started using Hive partitions as triggers for many jobs.

About the benefits of using DAG-dependency, the time aspect (using push instead of pull) is I think neglectable for us. The cascading triggering is however very interesting! If we can casacde-rerun all children DAGs of a given DAG, this would be a reason for me to sign up on using DAG-dependency in addition to data-triggers :)

Ya, which is maybe a reason to do both! Especially for the final output of a DAG. Some jobs (like those outside of our own airflow instance?) can use the _SUCCESS files, while others can use DAG deps.

It allows cascading triggering of jobs to build or rebuild the dependent datasets.

Agree, that would be really valuable!

So, as a recap:

It seems, we need the ability for Airflow jobs to generate success files in any case. Plus, we already implemented and deployed the URLTouchOperator, which does exactly that.

What remains to be seen is whether we still want to use DAG/task dependencies. If cascading triggering of jobs works well for us, that would be a sufficient reason to use them.

Conclusion:

I think we should prioritize at some point to do a time-boxed spike to test cascading of DAG/task dependencies. If that works like we expect, we can go ahead and modify the existing DAGs that would benefit from DAG/task dependencies. And also document the use of DAG/task dependencies on Wikitech, so that new jobs can make proper use of them.

Adding something to check while looking at DAG/Task dependencies: how those feed into data catalog.

@Ottomata @JAllemandou @mforns @Antoine_Quhen Did you reach an agreement on this one?

I think the decision depends on a research that we have not yet done. We should do a time-boxed spike to test cascading of DAG/task dependencies and taking into consideration how those feed into data catalog.
Alternatively, we could just continue the migration without creating DAG/task dependencies (like we've done so far), and then in the future, if we need to improve the pipeline dependency management, we can do the spike and modify the jobs if feasible!
I lean a bit towards the latter, but let me know team what you think!

JArguello-WMF removed a project: Data-Engineering-Kanban.Jun 23 2022, 8:03 PM

JArguello-WMF moved this task from Transform to Event Platform Backlog on the Data-Engineering board.Jun 29 2023, 11:16 PM

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 5:54 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:50 PM

lbowmaker moved this task from Event Platform Backlog to Incoming (new tickets) on the Data-Engineering board.

lbowmaker closed this task as Resolved.Feb 12 2024, 2:04 PM

lbowmaker claimed this task.

[Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style)Closed, ResolvedPublicActions

Description

Event Timeline

[Airflow] Research, discuss and decide on DAG/task dependencies VS. success/failure files (Oozie style)
Closed, ResolvedPublic
Actions