Page MenuHomePhabricator

Create a `DbtSkeinOperator` in the Airflow `wmf_airflow_common` library
Open, Needs TriagePublic

Description

Create a new kind of Airflow operator class that inherits from the SimpleSkeinOperator and is specialized for running dbt models from the dbt-jobs repository via Airflow.

This class should:

  • Provide a sane, working default configuration of the SimpleSkeinOperator that facilitates configuring and running the dbt jobs in our cluster.
  • Provide a way of configuring the specific dbt model(s) for the operator to run.
  • Provide a way of configuring the GitLab repository with dbt models, dbt-jobs being the default.
  • Provide a way of configuring the Spark session-specific settings, e.g. spark.executor.cores and such.

Event Timeline

Suggestion: Instead a new DbtSkeinOperator inheriting from SimpleSkeinOperator, consider making a new DbtOperator that uses a new DbtSkeinHook (itself probably using SkeinHookBuilder?) to run the skein+dbt commands.

This is how our SparkSubmitOperator works. This allows the 'launcher' used to be varied by a parameter, making it easier to use in different environments, e.g. development, yarn, k8s, etc.

@Ottomata I'm looking into that, thanks for the suggestion!

@amastilovic apologies for late/drive by review, but I left some comments, especially around the use of hardcoded /tmp vs tempfile.

I see you ended up going with DbtSkeinOperator inheriting from SimpleSkeinOperator. I'm curious to learn why you chose that over a DbtOperator and using the SkeinHook and SkeinHookBuilder to implement this.

One of the comments I added to the MR is about the fact that users can use your new DbtSkeinOperator as if it was a SimpleSkeinOperator. I'm not sure if that is intentional.

I see you ended up going with DbtSkeinOperator inheriting from SimpleSkeinOperator. I'm curious to learn why you chose that over a DbtOperator and using the SkeinHook and SkeinHookBuilder to implement this.

Mostly for the reasons of expediency - inheriting from SimpleSkeinOperator was a much quicker and well-tested route. Also, SimpleSkeinOperator itself is using SkeinHook and its builder. I might be missing something but from what I could see I would basically end up duplicating that same code in a DbtOperator.

Another reason is that this DbtSkeinOperator is actually a simple Skein/YARN application - it executes DBT commands in YARN using Skein.

Understood.

Its just unfortunate that when we move to launching dbt via k8s, we will have to build a new operator and edit dags that use DbtSkeinOperator. If we had done a DbtOperator with a launcher like param, this could be changed via params/config.

Also, SimpleSkeinOperator itself is using SkeinHook and its builder. I might be missing something but from what I could see I would basically end up duplicating that same code in a DbtOperator.

Indeed, you'd probably duplicate much of it, but only the relevant parts.