While testing a very simple DAG with one task that is using pyarrow and fsspec to download a file from HDFS to local filesystem, we encountered the following error:
File "pyarrow/_hdfs.pyx", line 96, in pyarrow._hdfs.HadoopFileSystem.__init__ File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status OSError: HDFS connection failed
After some debugging, we discovered that the airflow-devenv-username-task-shell-xxxxx pods do not have the CLASSPATH environment variable set:
runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$ export | grep -i classpath runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$
After manually setting the following, the pyarrow and fsspec code worked as expected:
runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$ export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`
Please include this env var in the Airflow's helm chart deployments so it's available to airflow-devenv DAGs and their tasks.