Page MenuHomePhabricator

Broken DAG Error when trying to import Gitlab .tgz file into airflow
Closed, ResolvedPublic

Description

I'm trying to develop a new DAG on the analytics airflow instance to automate the differential privacy pageview release.

In order to import custom package into my dev airflow instance, I added a new config to artifacts.yaml (which uses an outside URL to get the package from Gitlab). However, on running ./run_dev_instances.sh, I get the following error:

DAG Import Errors (1)

Broken DAG: [/srv/home/htriedman/airflow-dags/analytics/dags/differential_privacy/country_project_page_daily.py] Traceback (most recent call last):
  File "/srv/home/htriedman/airflow-dags/wmf_airflow_common/operators/spark.py", line 270, in for_virtualenv
    **kwargs
  File "/srv/home/htriedman/airflow-dags/wmf_airflow_common/hooks/spark.py", line 500, in kwargs_for_virtualenv
    f'virtualenv_archive must be a .zip, .tar.gz or .tgz file. Was {virtualenv_archive}'
ValueError: virtualenv_archive must be a .zip, .tar.gz or .tgz file. Was hdfs:///wmf/cache/artifacts/airflow/analytics/https___gitlab.wikimedia.org_repos_security_differential-privacy_-_package_files_716_download

What background do I need to understand why this doesn't work? What could I change here to fix this bug? Thanks so much.

Event Timeline

A workaround is available for this issue by not using the artifacts() helper function, and declaring the tarball HDFS URI directly in the DAG.

This is likely not an issue anymore, as I have not hit this recently. Agree @Htriedman ?

xcollazo claimed this task.