Page MenuHomePhabricator

Airflow Dev Env Task Shell Pod Missing CLASSPATH Environment Variable
Closed, ResolvedPublic

Description

While testing a very simple DAG with one task that is using pyarrow and fsspec to download a file from HDFS to local filesystem, we encountered the following error:

File "pyarrow/_hdfs.pyx", line 96, in pyarrow._hdfs.HadoopFileSystem.__init__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: HDFS connection failed

After some debugging, we discovered that the airflow-devenv-username-task-shell-xxxxx pods do not have the CLASSPATH environment variable set:

runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$ export | grep -i classpath
runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$

After manually setting the following, the pyarrow and fsspec code worked as expected:

runuser@airflow-dev-astein-task-shell-b4dbb8586-sd942:/opt/airflow$ export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob`

Please include this env var in the Airflow's helm chart deployments so it's available to airflow-devenv DAGs and their tasks.

Event Timeline

some more info- i was able to connect to hdfs and get file info in airflow devenv without CLASSPATH, by using wmf_airflow_common.util.hdfs_client. i dont know if that's the preferred way to interact with hdfs. code now looks like:

@task(executor_config=executor_config_with_proxy)
    def download_from_hdfs(hdfs_url: str) -> str:
        hdfs = util.hdfs_client(hadoop_name_node)
        file_info = hdfs.get_file_info(hdfs_url)

Note that wmf_airflow_common.util.hdfs_client works because it sets the CLASSPATH itself - it would be preferable for this variable to be set in the deployment chart instead:

def hdfs_client(name_node: str) -> HadoopFileSystem:
    """Returns an HDFS client for the given Hadoop name node.
    Args:
        name_node   URI for Hadoop name node.
                    (i.e.: hdfs://analytics-hadoop)
    """
    stream = os.popen("hdfs classpath --glob")
    os.environ["CLASSPATH"] = stream.read().strip()
    return HadoopFileSystem(name_node)

Change #1299527 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: export the CLASSPATH environment variable into the task-pod shell

https://gerrit.wikimedia.org/r/1299527

Change #1299525 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: export the CLASSPATH environment variable into the task-pod shell

https://gerrit.wikimedia.org/r/1299525

Change #1299525 abandoned by Brouberol:

[operations/deployment-charts@master] airflow: export the CLASSPATH environment variable into the task-pod shell

https://gerrit.wikimedia.org/r/1299525

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2283

docker: export the CLASSPATH environment variables to all containers using the airflow image

brouberol changed the task status from Open to In Progress.Wed, Jun 10, 7:00 AM
brouberol claimed this task.

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2283

docker: export the CLASSPATH environment variables to all containers using the airflow image

https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/jobs/855777 published

docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.11.2-py3.11-2026-06-10-112304-329cb255db7013430d0d66d170db6803f44d855d@sha256:a993deaf913d7443a7d48e6740c265556bb62056a87a5be9165bf70042d8459a

which should contain all required environment variables. @amastilovic can I let you test it in a devenv?

Change #1300131 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add ARROW_LIBHDFS_DIR/LD_LIBRARY_PATH to the of hadoop env vars

https://gerrit.wikimedia.org/r/1300131

Change #1300131 merged by Brouberol:

[operations/deployment-charts@master] airflow: add ARROW_LIBHDFS_DIR/LD_LIBRARY_PATH to the of hadoop env vars

https://gerrit.wikimedia.org/r/1300131

Change #1300147 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: upgrade the image

https://gerrit.wikimedia.org/r/1300147

Change #1300147 merged by Brouberol:

[operations/deployment-charts@master] airflow: upgrade the image

https://gerrit.wikimedia.org/r/1300147

We managed to make it work by hardcoding CLASSPATH in the image and setting the other env vars in the chart.

All airflow instances have been redeployed with the CLASSPATH / LD_LIBRARY_PATH / ARROW_LIBHDFS_DIR env vars. We should be good to close.

Change #1299527 abandoned by Brouberol:

[operations/deployment-charts@master] airflow: export the CLASSPATH environment variable into the task-pod shell

Reason:

Sorry, I should have closed this one. This was made redundant by https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/2288 and https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1300131

https://gerrit.wikimedia.org/r/1299527