Page MenuHomePhabricator

Explore a local dbt environment setup (independent from Conda)
Closed, ResolvedPublic

Description

Our current dbt development workflow relies on conda-analytics, which provides a prebuilt virtual environment with all the dependencies needed to run dbt. While this setup works and is widely used, it creates some complications:

  • We need to modify conda-analytics if we want to introduce new dependencies.
  • The CI/CD setup and likely the Production setup will work with Docker and a different setup, creating inconsistencies between all the environments.

To improve this, we propose creating a local environment setup directly within the dbt repository. This environment will be designed to run consistently across local development, CI/CD pipelines, and production.

By defining the environment in the repo itself, developers will benefit from:

  • A self-contained setup: everything needed to run dbt is included in the repository, without relying on external packaging.
  • More consistency across environments.
  • Adding or updating dependencies can be done directly in the dbt repo, without waiting for updates in conda-analytics.

Some considerations:

  • The conda-analytics has not only dependencies, but some configurations that make possible to run a spark-session connected to the Hadoop cluster when runing on Stat machines. Probably some routes to the Hive and Hadoop configuration files.
  • Maybe a completely local setup is not worth with spark-session as it might require a Spark installation and we can still rely on the one in Stat machines.
  • Maybe a simpler solution is to have specific dbt dependencies in the dbt repository (like dbt and sqlfluff) and add the conda-analytics as a dependency. It would avoid the need of conda-analytics modifications but we'll still rely on it to connect to the cluster.
  • A local setup might require a tool for dependency management in Python

Event Timeline

Added some documentation in a Readme in a MR. The Conda-analytics is giving dependencies but also some environment variables.

Running dbt with Poetry in the stat machines work, but the Hive, Hadoop, and Spark config folders need to be configured.

poetry install

export HIVE_CONF_DIR=/etc/hive/conf
export SPARK_CONF_DIR=/etc/spark3/conf
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/usr/lib/spark3

poetry run dbt run

works on stat machines.