Page MenuHomePhabricator

Conda-Analytics pinned file does not constrain Pip installations
Open, Needs TriagePublic

Description

Although Conda is our main tool for package and environment management on the stat hosts, it's sometimes necessary to use Pip as well (usually when particular packages are not available from Conda-Forge). Morever, I suspect that many analysts default to using Pip for all package management because it's more familiar and it generally works even within a Conda-Analytics environment.

However, although Pip generally works, Pip is unaware of the Conda version pins we've set up to prevent users from accidentally breaking their environments.

I encountered this recently when trying to install Pandas-GBQ. The Conda-Forge version is outdated, so I fell back to Pip. However, in the process of installing the package's dependencies, Pip updates to the latest versions of PyArrow and Numpy. This breaks several important workflows like querying Spark (which is why we have the Conda pins to prevent Conda from doing the same thing).

This seems like a bug in Pip, as the pre-installed versions of PyArrow and Numpy already satisfied Pandas-GBQ's requirements and Pip's default strategy is not to upgrade unless necessary.

I was able to solve this by manually downgrading, but when I tried to help @Mayakp.wiki do the same thing, she ran into a bunch of problems. After a couple hours of troubleshooting, we were eventually able to fix it by creating a brand new environment and then manually providing the PyArrow and Numpy constraints when installing Pandas-GBQ.

However, we can fix the problem at the root by providing our version pins to Pip in a constraints file. Interestingly enough, there's actually a syntax that works for both a Conda pinned file and a Pip constraints file:

# Conda doesn't allow `==` on the command line, but does in a file
pyspark==3.1.2
numpy<1.24.0
pandas<2.0.0
pyarrow==9.0.0
# The minor version must be specified (`5.5.*`, not `5.5`), at least with the equals operator
jupyter_core==5.5.*
jupyterhub==1.5.0
jupyterhub-systemdspawner==0.15.0
jupyterhub-ldapauthenticator==1.3.2
jupyterlab_server==2.25.*
sqlalchemy<2.0
jupyterlab==3.4.8

So I suggest we:

  1. Update our conda pinned file to use this syntax by:
    1. Updating conda_environment.yml to always specify the patch version (which can be .*)
    2. Updating generate_conda_pinned.sh to replace = with ==
  2. Configure Pip to treat the modified pinned file as a constraints file. It seems like the best way would be to have conda-analytics-activate set the PIP_CONSTRAINT environment variable with the file path, so that it can use the environment-specific pinned file.