Page MenuHomePhabricator

Package versions in Conda-Analytics are not pinned
Open, MediumPublic

Description

It might seem that versions of crucial packages in Conda-Analytics are pinned in conda-environment.yml (which in turn adds them to conda-environment.lock.yml).

However, those files just specify what versions should be installed in the new environment to start; Conda happily ignores them in all future transactions. This doesn't just mean that package A will be updated if the user runs conda update A. If the user runs conda install B and B lists A as a dependency, Conda will automatically upgrade A to the latest version (even if B's requirement is already satisfied by the existing version).

As you can imagine, this is a huge source of environment problems!

It should be easy to fix this by actually pinning versions when necessary by adding the specifications to a pinned file in the environment's conda-meta directory (docs).

Here's an example pinned file:

jupyter_core ==5.5.0
jupyter_server ==1.24.0
jupyter_telemetry ==0.1.0
jupyterhub ==1.5.0
jupyterhub-ldapauthenticator ==1.3.2
jupyterhub-singleuser ==1.5.0
jupyterhub-systemdspawner ==0.15.0
jupyterlab ==3.4.8
jupyterlab_pygments ==0.2.2
jupyterlab_server ==2.25.0
# https://phabricator.wikimedia.org/T356230
numpy <1.24
# https://phabricator.wikimedia.org/T356230
pandas <2.2
pyspark ==3.1.2
python ==3.10.*
sqlalchemy <2

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/data-engineering/conda-analytics!43pin_essential_conda_analytics_packagesmainstevemunenePin essential conda-analytics packages
repos/data-engineering/conda-analytics!42pin_essential_conda_analytics_packagesmainstevemuneneDraft: Pin essential conda-analytics packages
Customize query in GitLab

Event Timeline

Gehel triaged this task as Medium priority.Feb 9 2024, 1:29 PM
Gehel moved this task from Incoming to 2024.02.12 - 2024.03.03 on the Data-Platform-SRE board.

We have introduces a conda analytics pinned file with pandas and numpy versions for starters and built the dev deb package which we are going to test on an-test-client1002

Mentioned in SAL (#wikimedia-analytics) [2024-04-03T11:46:02Z] <stevemunene> disable puppet on an-test-client1002 to test new conda-analytics version T356231

New package introduces a pinned file for the base environment

stevemunene@an-test-client1002:~$ cat /opt/conda-analytics/conda-meta/pinned 
# https://phabricator.wikimedia.org/T356230
numpy <1.24
# https://phabricator.wikimedia.org/T356230
pandas <2.2

The current pinned file is a base model and we might need a more standardised production pinned file cc @nshahquinn-wmf

@Stevemunene I just created a new cloned Conda environment on an-test-clinet1002 using the Jupyter GUI. However, it doesn't have a pinned file:

nshahquinn-wmf@an-test-client1002:~/.conda/envs/2024-04-03T21.34.11_nshahquinn-wmf/conda-meta$ cat pinned
cat: pinned: No such file or directory

The current pinned file is a base model and we might need a more standardised production pinned file cc @nshahquinn-wmf

The "base model" pinned file does work pretty well: I manually added it to my environment, tried installing/updating a fairly long list of packages an analyst would likely use, and none of the non-pinned crucial packages (e.g. Python, Pyspark, Jupyter packages) got touched since they weren't in the dependency tree.

However, I still think it would be better to have a larger pinned file; in the past (before I introduced my own pinned files), I have broken things quite severely by trying to update Python or Jupyterlab or running conda update -all, and it's better to just make those safe even if most people aren't likely to try them 😁

Thanks @nshahquinn-wmf at the moment the pinned file can only be included in clones if the user wishes to. There is not yet a default way to avail this which does not have the optimal UX.
The pinned file can be availed during cloning by introducing the --pinned tag when cloning shown below;

stevemunene@an-test-client1002:~$ conda-analytics-clone test-pinned --pinned
Creating new cloned conda env test-pinned...
Source:      /opt/conda-analytics
Destination: /home/stevemunene/.conda/envs/test-pinned
.
.
.
Alternatively, you can use the conda-analytic helper script:
  source conda-analytics-activate test-pinned

Checking for the pinned file

stevemunene@an-test-client1002:~$ cat /opt/conda-analytics/conda-meta/pinned 
# https://phabricator.wikimedia.org/T356230
numpy <1.24
# https://phabricator.wikimedia.org/T356230
pandas <2.2

Looking for a way to go around this and avail it by default to all cloned environments for us.