Python virtual environment does not seem to get properly activated by a job using the new Jobs framework
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Huji
	May 26 2022, 1:11 PM

Description

List of steps to reproduce (step by step, including full links if applicable):
I have a bunch of jobs that I am currently running using the old Grid Engine and I want to move them to the new Jobs framework. All of my jobs start by activating a python virtual environment and then using the python3 command from this venv to invoke a Python script. The work with no issues using the Grid Engine.

For the sake of this bug report, we can use ][https://github.com/PersianWikipedia/fawikibot/blob/master/HujiBot/grid/jobs/daily.sh|this script]], also copied below:

daily.sh

. ~/venv/bin/activate
python3 ~/core/pwb.py touch -page:"صفحهٔ_اصلی" -purge
python3 ~/core/pwb.py hujibot_20
python3 ~/core/pwb.py weekly-slow 14

What happens?:
When I send the job to the Grid Engine using jsub -N "daily" -once -o ~/err/daily.out -e ~/err/daily.err ~/grid/jobs/daily.sh it works without issues.

When I send a similar job to the new Jobs framework using toolforge-jobs run "daily" --command ~/grid/jobs/daily.sh --image tf-bullseye-std the job fails quickly. Here is what I see in the daily.err file which the new Jobs framework creates in the home directory of my tool account:

/data/project/huji/grid/jobs/daily.sh: 2: python3: not found
/data/project/huji/grid/jobs/daily.sh: 3: python3: not found
/data/project/huji/grid/jobs/daily.sh: 4: python3: not found

This indicates that the . ~/venv/bin/activate line executed with no issue (note that the first error line is line 2), but despite that, python3 command is not available.

What should have happened instead?:
When I run . ~/venv/bin/activate interactively or using the Grid Engine, python3 is available. It should be the same with the Jobs framework too.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Huji	T319800 Migrate huji from Toolforge GridEngine to Toolforge Kubernetes
		Resolved	BUG REPORT	Huji	T309309 Python virtual environment does not seem to get properly activated by a job using the new Jobs framework

Event Timeline

Huji created this task.May 26 2022, 1:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2022, 1:11 PM

Huji updated the task description. (Show Details)May 26 2022, 1:12 PM

You should use the tf-python39 image instead of tf-bullseye-std.

@JJMC89 no that is not true. The first step in the shell script above is to load a venv, so even with tf-bullseye-std it should still use my venv's python (not the OS python).

If I use tf-python39 as you described, it still ends up using OS's python; case in point, my second command (the one that invokes pywikibot using pwb.py now fails with: "No module named 'requests'" which is because the pywikibot script actually references the python module requests that doesn't come with the OS python. If it was correctly running my venv's python, that error would not be shown, because in my venv I have already installed requests using pip.

So, the issue is not with which image to use. It is with venv not really being activated when it should be.

Did you create the venv from the same image that you are trying to run it in?

Ummm. I'm not sure I understand you here.

The venv is created not by a job; I created it on the home directory of my tool. In the legacy Grid Engine case, that was okay and I could just activate it as the first step of each job. Are you saying that the new Jobs framework (which I understand is k8s-based) needs one to create a venv by the Job itself?

Or are you saying that my old way would still work, but I have to re-create the venv now that we've moved to Stretch (because I certainly created the venv during Trusty era).

And if the former is correct, i.e. I should add python -m venv ... followed by pip install ... to the beginning of my job scripts, then is there a way to at least share a venv across jobs? Most my jobs use the same set of 10-20 pip packages, so if we could avoid creating a venv and installing the same packages every single run, we could waste less resources.

You don't need to create the venv in each job, but it seems like you may have created your venv from the bastion instead of inside k8s. You definitely should create a new venv if the current one is from Trusty (even if you were not using k8s) since the python version has changed.

Help:Toolforge/Python#Kubernetes python jobs and Help:Toolforge/Pywikibot#Using the virtual environment on Kubernetes have examples of how to setup a venv in k8s.

I am still having issues.

I logged into Toolforge and used become to log in as my tool account. Then I did the following:

which python3 # returns /usr/bin/python3
python3 --version # returns Python 3.7.3
python3 -m venv venv_for_tfj
. venv_for_tfj/bin/activate # activates this venv
which python3 # returns /mnt/nfs/labstore-secondary-tools-project/huji/vent_for_tfj/bin/python3
pip3 install requests # successfully installs the pip package and its dependencies

I also created a python job saved at ~/tfj/job.py:

import requests

print("Hello world!")

And a wrapper script to run this job after activating the venv (saved at ~/tfj/job.sh):

which python3
. venv_for_tfj/bin/activate
which python3
python3 tfj/job.py

Then I run this wrapper script by calling . tfj/job.sh manually on the command line, and the output I get made sense:

/usr/bin/python3
/mnt/nfs/labstore-secondary-tools-project/huji/venv_for_tfj/bin/python3
Hello world!

But when I ran the same wrapper script as a job on the Jobs framework using tfj run "test" --command ~/tfj/job.sh --image tf-python39 the output saved in the logs did not make sense:

test.err
Traceback (most recent call last):
  File "/data/project/huji/tfj/job.py", line 1, in <module>
    import requests
ModuleNotFoundError: No module named 'requests'

test.out
/usr/bin/python3
/usr/bin/python3

It seems like the venv is not being activated when I run the script as a job.

Can you help please?

Running from the bastion works fine for me, showing the expected venv python. However, to run in k8s, you need to create the venv from k8s instead of the bastion.

cat tfj/setup-venv.sh

python3 -m venv venv_for_tfj
. venv_for_tfj/bin/activate
pip3 install requests

toolforge-jobs run setup-venv --command $HOME/tfj/setup-venv.sh --image tf-python39 --wait

cat setup-venv.out

Collecting requests
  Using cached requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
Collecting charset-normalizer~=2.0.0
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2022.5.18.1-py3-none-any.whl (155 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.3-py3-none-any.whl (61 kB)
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests
Successfully installed certifi-2022.5.18.1 charset-normalizer-2.0.12 idna-3.3 requests-2.27.1 urllib3-1.26.9

toolforge-jobs run test --command ~/tfj/job.sh --image tf-python39 --wait

cat test.out

/usr/bin/python3
/data/project/jjmc89-bot-dev/venv_for_tfj/bin/python3
Hello world!

I will investigate this a bit more, but at least on a cursory look, it seems like the venv that is created by k8s will not work properly on bastion. That is not a deal breaker, but it is a bit annoying that I can only use k8s venvs wit k8s and bastion venvs with bastion. Makes configurations harder to make and test.

My guess is this is because bastions have Python 3.7 and k8s has Python 3.9 but that doesn't fully make sense to me either, because ultimately when you activate a venv (whether on bastion or k8s) you should end up using the python that is from the venv and essentially be on Python 3.9 in both cases. And I have certainly used venv's with different versions of python on the same VM in the past; just no using k8s.

PS: by "work properly" I mean I can for instance run a pip3 install ... via a job deployed on k8s, but if I just activate the k8s-created venv on bastion using my tool account and try to run pip3 install I will get an error.

I have found two strategies that would work for me. @JJMC89 and @taavi I would appreciate it if you could advise me as to which you think is best.

Option one: create a venv for each job

For instance, I can run a script like this:

cd
timestamp="$(date +"%Y%m%d%I%M%S")"
venvdir="v_${timestamp}"
eval "python3 -m venv ${venvdir}"
eval ". ${venvdir}/bin/activate"
pip3 install -r ~/grid/requirements.txt
python3 ~/core/pwb.py centitanha
deactivate
eval "rm -rf ${venvdir}"

This would replace an existing script that is like this:

source ~/venv/bin/activate
python3 ~/core/pwb.py centitanha

I could go even crazier and specify a separate requirements.txt file per job, or type out the list of pip packages in the job instead of referencing the requirements.txt file but since most of my jobs use the same pip packages (PyMSQL, ipwhois, requests and a few more) it is easier to have one requirements.txt file for all of them.

Pros:

Easy to maintain template
Since venv is created and deleted by the job, my tool's home directory will remain relatively clean
I may be able to templatize this, such as I only have to pass that third-to-last command to the template and it would wrap it, in which case the new job template would also be 1 or 2 lines which would be neat.
I can run the jobs interactively too; see comments above regarding how a venv created interactively won't work with Jobs framework and vice versa; this approach allows me to run the exact same job both interactively and via Jobs framework and it will work both ways.

Cons:

Jobs now take much longer to start, because they have to create a venv and install pip packages each time.
Jobs will install some pip packages each time that may not be used.

Option two: create a venv for all jobs using a job

I can have a job which specifically creates a venv with all my requirements in it, something like this:

cd
venvdir="venv_for_tfj"
eval "python3 -m venv ${venvdir}"
eval ". ${venvdir}/bin/activate"
pip3 install -r ~/grid/requirements.txt
deactivate

Then in my specific jobs, instead of sourcing ~/venv/bin/activate (which is a venv created interactively by me) I will source ~/venv_for_tjf/bin/activate.

Pros:

Overhead is smaller, as the venv creating job is only run by me from time to time (say once every few months) for the sake of updating my venv
Actual job scripts will only be minimally modified

Cons:

Jobs will only work if invoked by the Jobs framework. Running them by hand will result in issues caused by the different version of Python in the Jobs framework (3.9) versus the interactive shell of bastions (3.7 I think).

I use option 2. If I need to do something interactively (rare), then I use webservice python3.9 shell with the same venv.

Coming back to this just to memorialize how things finally got to work.

Step 0: set up the desired directory structure

I ended up setting up a dedicate directory under my tool's home like this:

~/tfj
├── jobs
│   ├── job.py
│   └── job.sh
└── logs
    └── job.log

Step 1: set up the python virtual environment

I ran a tfj job to create the venv in the first place; importantly, I ran it using the tf-python37 image so it uses the same python as bastions:

job.sh

python3 -m venv ~/env

Run with command: tfj run "job" --command ~/tfj/jobs/job.sh -e ~/tfj/logs/job.log -o ~/tfj/logs/job.log --image tf-python37

Step 2: install the requirements in that venv

Next, on bastion, I activated that venv (it works I suppose because the python versions are compatible) and ran python3 -m pip install -r requirements.txt referencing the requirements file for my projects.

At this point, a simple job that loaded that venv would work both from interactive command line on bastion, and from tfj.

job.sh

which python3
. ~/env/bin/activate
which python3
python3 ~/tfj/jobs/job.py
deactivate

Run with command: tfj run "job" --command ~/tfj/jobs/job.sh -e ~/tfj/logs/job.log -o ~/tfj/logs/job.log --image tf-python37

Step 4: migrate

Now that everything is working similar to the old grid engine way, I can migrate my scripts to use tfj instead! Time to move to T319800

Huji closed this task as Resolved.Dec 4 2023, 11:28 PM

Huji claimed this task.

Huji added a parent task: T319800: Migrate huji from Toolforge GridEngine to Toolforge Kubernetes.

Huji mentioned this in T356582: Do not deprecate python versions on the toolforge jobs framework that are the default version on toolforge.Feb 3 2024, 11:18 PM

Python virtual environment does not seem to get properly activated by a job using the new Jobs frameworkClosed, ResolvedPublicBUG REPORTActions