Page MenuHomePhabricator

Sqlufluff on Stat hosts
Closed, ResolvedPublic

Description

dbt users might be using Stat hosts to work on their code. dbt command is available on the Stat hosts through the Conda Analytics environment, but the sqlfluff command is not available.

We need to enable sqlfluff to be run in the Stat hosts.

Sqlfluff is already configured in the dbt-jobs repository through Poetry, but Poetry is not installed in the Stat hosts.

Some approaches could be:

  • Install sqlfluff in the Conda environment
  • Install Poetry in the Stat hosts
  • Describe in the Readme.md how to install Poetry in Stat hosts, so each developer can install it on their environment.

Sqlfluff will run with poetry run sqlfluff lint in case Poetry is available.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Sqlfluff on Stat Hostsrepos/data-engineering/dbt-jobs!21javiermontonfeature/add-sqlfluff-to-stat-hostsmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedJMonton-WMF

Event Timeline

I'd be in favor of the second option, installing Poetry (or pipx for that matter) on the Stat hosts. This would enable Stat machine users to safely install many different Python apps/packages, not just sqlfluff.

Install Poetry in the Stat hosts

+1 to install poetry in conda-analytics (if that is what you mean!)

I can see the appeal of adding both pipx and poetry to stat hosts, but I would be worried about a sort of inception like layering of virtual environments.
To my mind, this might get messy really quickly, especially if we have too many options.
The docs for poetry say this:

image.png (797×244 px, 43 KB)

So, going by this, we wouldn't want to add it to the base operating system, even though the package is available in trixie.

If we were to add it to conda-analytics, that might work, but we're already getting into enabling virtual environments with either the base conda-analytics environment, or a custom environment created by conda-analytics-clone mycoolenv

So although I'm not exactly a big fan of conda, I'd rather that we simply go with option 1 and add sqlfluff to conda-analytics in conda-environment.yml.

We could add the version from conda-forge sqlfluff or the version from PyPi: sqlfluff.

This would enable Stat machine users to safely install many different Python apps/packages, not just sqlfluff.

@amastilovic - I think that you can already install lots of different Python apps and packages, can't you? It's just that conda-analytics is currently the framework that we install at the operating system level to give people this functionality of creating virtual environments and easily switching between them.

If you want, you can even experiment with installing poetry and/or pipx within one or more of your conda envionments.
I just feel that:

  • if we try to do this automatically
  • ...for every user of every stat box
  • ...especially when we share home directories between stat boxes
  • ...and maybe also with JupyterLab on k8s
  • ...or kubeflow

...then the convenience of having pipx or poetry just work, might vanish and be replaced with dependency hell.

Maybe I'm over-thinking it, so if you feel that this is a way for us collectively to get out of conda-analytics, then please feel free to tell me that I'm talking rubbish.

@amastilovic - I think that you can already install lots of different Python apps and packages, can't you? It's just that conda-analytics is currently the framework that we install at the operating system level to give people this functionality of creating virtual environments and easily switching between them.

Oh we absolutely can already do it with Conda, using either packages from Conda-forge or within our own virtual environments. My comments were mostly pointing at pipx which makes it very easy to escape the hell of manually created virtual environments by installing every Python app within its own isolated environment. That's the recommended way of installing Poetry, too, hence that warning message about dedicated virtual environments you referenced.

The pipx workflow is a lot simpler IMHO, although for sure not necessary considering our current environment.

I'm sure I'm way off here, but I sort of understood that poetry was going to be used to install packages into a conda env, similar to how pip can be used to install into conda envs as well. Conda is used to manage the env, and pip/poetry can be used to install (non-conda) packages.

I'd also be reluctant to officially adopt another environment manager without a good evaluation.

However, a more straightforward solution might be: don't use poetry.

I'm sure poetry is very nice! So is gradle! And so is sbt! For JVM world, we explicitly decided [citation needed] to go with the boring & more standard packaging tech: maven.

For similar reasons, we decided to just use setuptools (modern-ish with setup.cfg and pyproject.toml) for mediawiki-event-enrichment and eventutilities-python.

These docs were never finished, but the (years old now) intention for the 'paved path' for job repos + conda envs was setuptools: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Developer_guide/Python_Job_Repos#GitLab_Job_Repository_Setup

If installing these tools is a concern, and we don't need to provide sqlfluff to many people, maybe we can just add an explanation of how to install it in a new virtual environment, or as part of a Makefile, so anyone that needs it, can install it only for them.

I'm trying it, and checking if it works with our dbt-jobs repository, and I believe this would be the right approach.

conda-analytics-clone mydbt
source conda-analytics-activate mydbt
set_proxy
pip install sqlfluff
pip install sqlfluff-templater-dbt
pip install dbt-core
pip install dbt-duckdb
unset_proxy

Next times they only need to activate mydbt rather than base, and they can run sqlfluff lint

I think we could add these commands in a Makefile, so they can simply clone the repository and run "make install" or similar.

I can open an MR if that's ok.

Btw, just for reference:

I expounded a little more on why we chose conda + setuptools instead of poetry for managing and deploying python job repos:

https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/merge_requests/113#note_190561