Our current dbt development workflow relies on conda-analytics, which provides a prebuilt virtual environment with all the dependencies needed to run dbt. While this setup works and is widely used, it creates some complications:
- We need to modify conda-analytics if we want to introduce new dependencies.
- The CI/CD setup and likely the Production setup will work with Docker and a different setup, creating inconsistencies between all the environments.
To improve this, we propose creating a local environment setup directly within the dbt repository. This environment will be designed to run consistently across local development, CI/CD pipelines, and production.
By defining the environment in the repo itself, developers will benefit from:
- A self-contained setup: everything needed to run dbt is included in the repository, without relying on external packaging.
- More consistency across environments.
- Adding or updating dependencies can be done directly in the dbt repo, without waiting for updates in conda-analytics.
Some considerations:
- The conda-analytics has not only dependencies, but some configurations that make possible to run a spark-session connected to the Hadoop cluster when runing on Stat machines. Probably some routes to the Hive and Hadoop configuration files.
- Maybe a completely local setup is not worth with spark-session as it might require a Spark installation and we can still rely on the one in Stat machines.
- Maybe a simpler solution is to have specific dbt dependencies in the dbt repository (like dbt and sqlfluff) and add the conda-analytics as a dependency. It would avoid the need of conda-analytics modifications but we'll still rely on it to connect to the cluster.
- A local setup might require a tool for dependency management in Python