Page MenuHomePhabricator

Airflow: pin dependency versions to prevent long installs
Closed, ResolvedPublic

Description

I just noticed that running pip install on the airflow-dags repository sometimes takes forever. One of the problems is that pip apparently has to guess which version of a dependency is the right one, so it downloads like... all of them? See below for the pyspark example, where it downloads a few dozen files, totaling over 7GB of useless downloads.

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
     |████████████████████████████████| 281.3 MB 189 kB/s 
Collecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 15.9 MB/s 
Collecting pyspark
  Downloading pyspark-3.1.3.tar.gz (214.0 MB)
     |████████████████████████████████| 214.0 MB 16.4 MB/s 
Collecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
     |████████████████████████████████| 198 kB 12.5 MB/s 
Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
     |████████████████████████████████| 212.4 MB 28.7 MB/s 
  Downloading pyspark-3.1.1.tar.gz (212.3 MB)
     |████████████████████████████████| 212.3 MB 76 kB/s 
  Downloading pyspark-3.0.3.tar.gz (209.1 MB)
     |████████████████████████████████| 209.1 MB 24.0 MB/s 
  Downloading pyspark-3.0.2.tar.gz (204.8 MB)
     |████████████████████████████████| 204.8 MB 36.6 MB/s 
  Downloading pyspark-3.0.1.tar.gz (204.2 MB)
     |████████████████████████████████| 204.2 MB 134.0 MB/s 
INFO: pip is looking at multiple versions of py4j to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of pyspark to determine which version is compatible with other requirements. This could take a while.
  Downloading pyspark-3.0.0.tar.gz (204.7 MB)
     |████████████████████████████████| 204.7 MB 18.4 MB/s 
  Downloading pyspark-2.4.8.tar.gz (220.5 MB)
     |████████████████████████████████| 220.5 MB 6.7 MB/s 
Collecting py4j==0.10.7
  Downloading py4j-0.10.7-py2.py3-none-any.whl (197 kB)
     |████████████████████████████████| 197 kB 6.1 MB/s 
Collecting pyspark
  Downloading pyspark-2.4.7.tar.gz (217.9 MB)
     |████████████████████████████████| 217.9 MB 20.4 MB/s 
  Downloading pyspark-2.4.6.tar.gz (218.4 MB)
     |████████████████████████████████| 218.4 MB 13.0 MB/s 
  Downloading pyspark-2.4.5.tar.gz (217.8 MB)
     |████████████████████████████████| 217.8 MB 10.8 MB/s 
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. If you want to abort this run, you can press Ctrl + C to do so. To improve how pip performs, tell us what happened here: https://pip.pypa.io/surveys/backtracking
  Downloading pyspark-2.4.4.tar.gz (215.7 MB)
     |████████████████████████████████| 215.7 MB 9.8 MB/s 
  Downloading pyspark-2.4.3.tar.gz (215.6 MB)
     |████████████████████████████████| 215.6 MB 27.8 MB/s 
  Downloading pyspark-2.4.2.tar.gz (193.9 MB)
     |████████████████████████████████| 193.9 MB 21.3 MB/s 
  Downloading pyspark-2.4.1.tar.gz (215.7 MB)
     |████████████████████████████████| 215.7 MB 189 kB/s 
  Downloading pyspark-2.4.0.tar.gz (213.4 MB)
     |████████████████████████████████| 213.4 MB 815 kB/s 
  Downloading pyspark-2.3.4.tar.gz (212.3 MB)
     |████████████████████████████████| 212.3 MB 10.9 MB/s 
  Downloading pyspark-2.3.3.tar.gz (211.9 MB)
     |████████████████████████████████| 211.9 MB 40.2 MB/s 
  Downloading pyspark-2.3.2.tar.gz (211.9 MB)
     |████████████████████████████████| 211.9 MB 79 kB/s 
  Downloading pyspark-2.3.1.tar.gz (211.9 MB)
     |████████████████████████████████| 211.9 MB 255 kB/s 
  Downloading pyspark-2.3.0.tar.gz (211.9 MB)
     |████████████████████████████████| 211.9 MB 12.6 MB/s 
Collecting py4j==0.10.6
  Downloading py4j-0.10.6-py2.py3-none-any.whl (189 kB)
     |████████████████████████████████| 189 kB 17.7 MB/s 
Collecting pyspark
  Downloading pyspark-2.2.3.tar.gz (188.5 MB)
     |████████████████████████████████| 188.5 MB 120 kB/s 
  Downloading pyspark-2.2.2.tar.gz (188.0 MB)
     |████████████████████████████████| 188.0 MB 6.1 MB/s 
  Downloading pyspark-2.2.1.tar.gz (188.2 MB)
     |████████████████████████████████| 188.2 MB 56 kB/s 
Collecting py4j==0.10.4
  Downloading py4j-0.10.4-py2.py3-none-any.whl (186 kB)
     |████████████████████████████████| 186 kB 14.5 MB/s 
Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3 MB)
     |████████████████████████████████| 188.3 MB 14.8 MB/s 
WARNING: Discarding https://files.pythonhosted.org/packages/f6/fe/4a1420f1c8c4df40cc8ac1dab6c833a3fe1986abf859135712d762100fde/pyspark-2.2.0.post0.tar.gz#sha256=9dc994118608ce12939d86dec27ce8a545cc6e6a4d76bca785a37322daa33a3c (from https://pypi.org/simple/pyspark/). Requested pyspark from https://files.pythonhosted.org/packages/f6/fe/4a1420f1c8c4df40cc8ac1dab6c833a3fe1986abf859135712d762100fde/pyspark-2.2.0.post0.tar.gz#sha256=9dc994118608ce12939d86dec27ce8a545cc6e6a4d76bca785a37322daa33a3c (from apache-airflow-providers-apache-spark->wmf-airflow-dags==0.1.0) has inconsistent version: filename has '2.2.0.post0', but metadata has '2.2.0'
  Downloading pyspark-2.1.3.tar.gz (181.3 MB)
     |████████████████████████████████| 181.3 MB 22.8 MB/s 
  Downloading pyspark-2.1.2.tar.gz (181.3 MB)
     |████████████████████████████████| 181.3 MB 115 kB/s

Event Timeline

Also, once we provide spark 3, we should make airflow-dags avoid depending on pyspark, if we can.

In the meantime, we should probably use airflow constraints to get the list of deps.

I faced the same issue and the problem was due to a failed install of a previous package due to a missing dependency on the host (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide#Setting_up_the_environment). I'd be interested to know if this issue was the same.

EChetty lowered the priority of this task from High to Medium.Jun 30 2022, 6:37 PM

I faced the same issue and the problem was due to a failed install of a previous package due to a missing dependency on the host (see https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow/Developer_guide#Setting_up_the_environment). I'd be interested to know if this issue was the same.

I did that and pip install . from my machine still looks at tons of versions of pyspark, gssapi, krb5, etc.

I've basically never been able to set up the testing environment properly, but I just figured that it was still WIP. We should fix this sooner than later. I don't think this is a medium priority @EChetty, if someone else is trying to develop against Airflow, this needs to be foolproof.

(though, to be fair, I don't see the same kind of problem in the CI environment, so maybe it's just mine, it's still a developer friendliness issue)