Page MenuHomePhabricator

Newpytyer python spark kernels
Closed, ResolvedPublic

Description

Hello - question about Newpyter with stacked conda envs. The snippet below fetches an image from swift a couple times, using http and https with and without verfication. After starting a new server with a new conda env, the snipped prints three <Response [200]> when executed with a PySpark - Local kernel. With the default Python 3 kernel however, the last https call fails with a SSL certification error. What added to my confusion is that in a notebook (ie on a stat machine) a plain http call to swift succeeds, while on a yarn worker the http call times out and you need to use https (with verify=false). Does anybody have an intuition why that is?

import requests
http_swift = "http://ms-fe.svc.eqiad.wmnet"
https_swift = "https://ms-fe.svc.eqiad.wmnet"
image_path = "/wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/100px-Tour_Eiffel_Wikimedia_Commons.jpg"
print(requests.get(http_swift+image_path))
print(requests.get(https_swift+image_path, verify=False))
print(requests.get(https_swift+image_path, verify=True))
SSLError: HTTPSConnectionPool(host='ms-fe.svc.eqiad.wmnet', port=443): Max retries exceeded with url: /wikipedia/commons/thumb/a/a8/Tour_Eiffel_Wikimedia_Commons.jpg/100px-Tour_Eiffel_Wikimedia_Commons.jpg (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)')))

It might have to do with how the pyspark kernels are started, ie. changing PYTHONPATH and creating the spark session at startup. For example, packages that are pip installed into a stacked conda env are not available in a pyspark kernel. As Newpyter is still work in progress, is the intention to have people create spark sessions manually in a notebook, or will we need to update the pyspark kernels to work with Newpyter?

Event Timeline

the intention to have people create spark sessions manually in a notebook

Quick response, this ^ is the intention, using wmfdata-python to aid in SparkSession creation if desired.

fdans triaged this task as Medium priority.
fdans moved this task from Incoming to Data Exploration Tools on the Analytics board.
Ottomata renamed this task from Newpytyer python kernels to Newpytyer python spark kernels.Feb 24 2021, 10:48 PM

I'd really like to make Fabian's script to auto pack and ship conda envs into yarn something we can use easily. It should work from the CLI as well as in Python notebook, whether or not the SparkSession is instantiated directly or via wmfdata-python.

I did a test today to see what the various PYTHON_* pyspark2 incantations would do on remote exectors:

def python_interpreter():
    import sys, platform
    return "{}: {}".format(platform.node(), sys.executable)
rdd = spark.sparkContext.parallelize(range(2), 2).mapPartitions(lambda p: [python_interpreter()])
rdd.collect()

Results with various settings via pyspark2:

# activate my conda env so that $(which python) is my conda's env's python.
source conda-activate-stacked

# default
pyspark2 --master yarn --num-executors 2
Out[1]: ['an-worker1111: /usr/bin/python3.7', 'an-worker1089: /usr/bin/python3.7']

# Attempting to use my conda python remotely, should fail.
PYSPARK_PYTHON=$(which python) pyspark2 --master yarn --num-executors 2
Cannot run program "/home/otto/.conda/envs/2021-02-24T20.02.33_otto/bin/python": error=2, No such file or directory

# Using anaconda-wmf python, should succeed since it exists remotely
PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python pyspark2 --master yarn --num-executors 2
['analytics1076: /usr/lib/anaconda-wmf/bin/python', 'an-worker1116: /usr/lib/anaconda-wmf/bin/python']

# Setting via spark.executorEnv or spark.yarn.appMasterEnv.  I think appMasterEnv only works in yarn cluster mode, which we are not concerned with atm.
pyspark2 --master yarn --num-executors 2 --conf spark.executorEnv.PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python
['an-worker1090: /usr/bin/python3.7', 'analytics1063: /usr/bin/python3.7']

pyspark2 --master yarn --num-executors 2 --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/lib/anaconda-wmf/bin/python
['an-worker1084: /usr/bin/python3.7', 'an-worker1080: /usr/bin/python3.7']

# Setting via spark.pyspark.python instead of PYSPARK_PYTHON
pyspark2 --master yarn --num-executors 2 --conf spark.pyspark.python=/usr/lib/anaconda-wmf/bin/python
['analytics1067: /usr/lib/anaconda-wmf/bin/python', 'an-worker1101: /usr/lib/anaconda-wmf/bin/python']

It looks like setting PYSPARK_PYTHON works well. Now, what I want is to do some smart sourcing of some env vars to automatically use local conda env with remote base anaconda-wmf env, OR if the user wants, to automatically pack and ship the user's conda env to the yarn executors so extra dependencies can be used.

Work in progress here:
https://gist.github.com/ottomata/71c563efe91748fdd10be813cc6e0a8f

It almost works! The only issue is PYSPARK_SUBMIT_ARGS does not seem to be respected if I go through the pyspark2 CLI. Passing in the --archives option to pyspark2 works thought.

I'd rather not write a 'conda-pyspark2' wrapper script, but I'm not sure how else to automate this yet. We could put all of this automation in wmfdata-python, but I'd really prefer to not depend on wmfdata-python to ship conda envs.

Aside: what is especially cool about this shipped conda env is that the anaconda.pth file that serves to accomplish PYTHONPATH stacking of both the use conda env and the base anaconda-wmf env works in the yarn executor! Very cool! If you look at one in a user conda env, you'll see:

cat /srv/home/otto/.conda/envs/2021-02-24T20.02.33_otto/lib/python3.7/site-packages/anaconda.pth
/usr/lib/anaconda-wmf/lib/python37.zip
/usr/lib/anaconda-wmf/lib/python3.7
/usr/lib/anaconda-wmf/lib/python3.7/lib-dynload
/usr/lib/anaconda-wmf/lib/python3.7/site-packages

anaconda.pth is packed and shipped to the yarn executors along with the rest of the conda env, and those /usr/lib/anaoncda-wmf paths exist on all the yarn worker filesystems, so they are automatically loadable by the remote python process too!

Change 667689 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Configure spark to work better with conda environments

https://gerrit.wikimedia.org/r/667689

Just wanted to chime in and say thanks for making this PR @Ottomata -- I probably won't getting around to testing it in the next week or two. I don't want to hold you up though because I know you're also doing a new base conda for Newpyter that presumably would benefit greatly from this patch but I'll take a look in a few weeks or when it's deployed!

I second Isaac`s comment. I reviewed the gh PR and tested successfully.

Change 667689 merged by Ottomata:
[operations/puppet@production] Configure spark to work better with conda environments

https://gerrit.wikimedia.org/r/667689

Change 668466 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Jupyter - never use webproxy for *.wmnet URLs and use system cacerts

https://gerrit.wikimedia.org/r/668466

@fkaelin https://gerrit.wikimedia.org/r/c/operations/puppet/+/668466 should fix your original bug around requests and CA certificates. Nice find, thank you!

Change 668566 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/debs/anaconda-wmf@debian] Add activate.d and deactivate.d env_vars.sh

https://gerrit.wikimedia.org/r/668566

Actually, I like this fix better:
https://gerrit.wikimedia.org/r/c/operations/debs/anaconda-wmf/+/668566

That will have to wait until the I get to make a new anaconda-wmf release (SOON!)

Change 668566 merged by Ottomata:
[operations/debs/anaconda-wmf@debian] Add activate.d and deactivate.d env_vars.sh

https://gerrit.wikimedia.org/r/668566

Change 668466 merged by Ottomata:
[operations/puppet@production] Jupyter - never use webproxy for *.wmnet URLs and make Java use system cacerts

https://gerrit.wikimedia.org/r/668466

Change 673083 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Always set REQUESTS_CA_BUNDLE for spark so that Python executors will use the proper CA certs

https://gerrit.wikimedia.org/r/673083

Change 673083 merged by Ottomata:
[operations/puppet@production] Always set REQUESTS_CA_BUNDLE for spark

https://gerrit.wikimedia.org/r/673083

This task ended up being a little bit more than Fabian's original bug report, but I think things are looking ok here. Feel free to reopen.

Thanks @Ottomata, I can also confirm that the certificates work now too, ie a request with verify=False now fails on the workers as well.