Page MenuHomePhabricator

Spike [2019-2020 work] Oozie Replacement. Airflow Study / Argo Study
Closed, ResolvedPublic

Description

Can Airflow substitute all of our various scheduling tools:

  • reportupdater
  • oozie
  • spark refine
  • some systemd timers
  • and ONE more!

Event Timeline

Milimetric triaged this task as Medium priority.Feb 28 2019, 5:45 PM
Milimetric updated the task description. (Show Details)
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

I just read a bunch of Airflow docs, and I'm not really sure how easy it will be to replace oozie. For all the others, it might be great! I haven't yet seen an ability to trigger runs based on dataset existence, but perhaps I'm just missing it.

I also just noted that is has (experimental?) 'lineage' support, which helps for keeping track of data lineage and governance, and has integration for use with Apache Atlas. This might be relevant to some Better use of Data use cases.

Ok, Joseph clued me into Airflow Sensors, which do indeed seem to do what we need.

https://github.com/apache/airflow/tree/master/airflow/sensors

FYI, RelEng is considering using Argo for CI in Kubernetes. Argo looks like it has some similarities with Airflow:

https://github.com/argoproj/argo/issues/849

Ottomata renamed this task from Spike [2019-2020 work] Airflow Study to Spike [2019-2020 work] Oozie Replacement study (Airflow, Argo, Pachyderm, Kubernetes, etc.) .Oct 30 2019, 4:32 PM
Nuria renamed this task from Spike [2019-2020 work] Oozie Replacement study (Airflow, Argo, Pachyderm, Kubernetes, etc.) to Spike [2019-2020 work] Ozie Replacement. Airflow Study / Argo Study.Oct 30 2019, 4:33 PM
Ottomata renamed this task from Spike [2019-2020 work] Ozie Replacement. Airflow Study / Argo Study to Spike [2019-2020 work] Oozie Replacement. Airflow Study / Argo Study.Oct 30 2019, 5:38 PM

Since the search team is managing a trial airflow setup, perhaps we should use their setup for this spike? We could try to replicate some existing use cases in Airflow. Perhaps:

  • webrequest load + druid 128 load
  • Refine

These are a bit different, but cover a lot of what we do with oozie and systemd timers. It'd be a good sign if we can make Airflow can do both well.

The Search's setup is very custom and not really re-usable IIUC, it would be really great to spent a bit of time trying to improve what it is currently in puppet and how Airflow is deployed (currently directly via scap in a Search gerrit repo, together with their code).

I like the idea of testing the above use cases, especially if we find a unified way to alarm. For example, the way that oozie notifies us about a problem is still an email, that is not great as we know, meanwhile timers leverage icinga.

The Search's setup is very custom and not really re-usable IIUC

I wouldn't want to actually replace our oozie&timer stuff, just try to do so and see if we can run things writing into scratch directories.

For example, the way that oozie notifies us about a problem is still an email, that is not great as we know, meanwhile timers leverage icinga.

I somehow doubt icinga will be the answer for us. Icinga doesn't allow for dynamic lasting alert statuses. In our current system, Hue is almost acting like Icinga for dataset generating jobs. We get an email from Oozie about a job failure, and then Hue shows us what has or hasn't failed. The systemd timer alerts in icinga only alert us on the most recent status of a job run.

But yeah, perhaps the Airflow UI will replace Oozie+Hue in a better way.

I somehow doubt icinga will be the answer for us. Icinga doesn't allow for dynamic lasting alert statuses.

Can you expand this? It is not that clear to me what the end goal is.. :)

What would Icinga look like if the webrequest load job had failures for 6 hourly datasets spread over the last month? We'd want an 'alert' on each of these failures.

Every job instance is an individual thing we'd want alerting on.

Or we could use an aggregator, that would say "at least one job failed etc.." and then use the Airflow UI to detect the failures like we do with Hue (but not sure if possible or ideal).

Oh yeah that would be good too! I just mean we wouldn't want to rely only on only the aggregate alerts; Icinga won't work for us as the main alerting solution.

Ottomata changed the task status from Duplicate to Resolved.May 19 2021, 8:17 PM
Ottomata closed this task as a duplicate of T241246: Spike: POC of refine with airflow.
Ottomata added subscribers: razzi, mforns.