Page MenuHomePhabricator

Crash of artifact-cache in scap deploy context
Closed, ResolvedPublic

Description

I wanted to deploy some new code of airflow-dags/analytics. The config/artifacts.yaml contained 1 more artifact. And the deployment crashed with an AttributeError.

aqu@deploy1002:/srv/deployment/airflow-dags/analytics$ scap deploy "T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@$(git rev-parse --short HEAD)]"
12:47:49 Started deploy [airflow-dags/analytics@cae0024]
12:47:49 Deploying Rev: HEAD = cae0024bdf0f517c0c2e4384705a76ccfc787293
12:47:49 Started deploy [airflow-dags/analytics@cae0024]: T302876_migrate_mediarequest_to_airflow [airflow-dags/analytics@cae0024]
12:47:49
== DEFAULT ==
:* an-launcher1002.eqiad.wmnet
airflow-dags/analytics: fetch stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
airflow-dags/analytics: config_deploy stage(s): 100% (in-flight: 0; ok: 1; fail: 0; left: 0)
12:47:57 ['/usr/bin/scap', 'deploy-local', '-v', '--repo', 'airflow-dags/analytics', '-g', 'default', 'promote', '--refresh-config'] (ran as analytics@an-launcher1002.eqiad.wmnet
) returned [1]: Could not chdir to home directory /nonexistent: No such file or directory
Executing check 'artifacts_sync'
Check 'artifacts_sync' failed: Traceback (most recent call last):
  File "/usr/lib/airflow/bin/artifact-cache", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/airflow/lib/python3.7/site-packages/workflow_utils/artifact/cli.py", line 30, in main
    artifact.cache_put(force=args['--force'])
  File "/usr/lib/airflow/lib/python3.7/site-packages/workflow_utils/artifact/artifact.py", line 65, in cache_put
    cache.put(self.id, self.source.open(self.id), force=force)
  File "/usr/lib/airflow/lib/python3.7/site-packages/workflow_utils/artifact/cache.py", line 113, in put
    with self.open(artifact_id) as output:
  File "/usr/lib/airflow/lib/python3.7/site-packages/workflow_utils/artifact/cache.py", line 108, in open
    return fsspec.open(url, mode='wb').open()
  File "/usr/lib/airflow/lib/python3.7/site-packages/fsspec/core.py", line 150, in open
    out.close = close
AttributeError: can't set attribute

... Then rollback

Later, running the code directly, it worked:

aqu@an-launcher1002:/srv/deployment/airflow-dags/analytics$ sudo -u analytics         /usr/local/bin/kerberos-run-command analytics         /usr/lib/airflow/bin/artifact-cache
      warm         /srv/deployment/airflow-dags/analytics/wmf_airflow_common/config/artifact_config.yaml         /srv/deployment/airflow-dags/analytics/analytics/config/artifacts.yaml
Artifact(refinery-job-0.1.23-shaded):
        hdfs:///wmf/cache/artifacts/airflow/org.wikimedia.analytics.refinery.job_refinery-job_jar_shaded_0.1.23 (exists=True)
        https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/job/refinery-job/0.1.23/refinery-job-0.1.23-shaded.jar       (exists=True)
Artifact(refinery-job-0.1.24-shaded):
        hdfs:///wmf/cache/artifacts/airflow/org.wikimedia.analytics.refinery.job_refinery-job_jar_shaded_0.1.24 (exists=True)
        https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/job/refinery-job/0.1.24/refinery-job-0.1.24-shaded.jar       (exists=True)
Artifact(refinery-hive-0.1.25-shaded):
        hdfs:///wmf/cache/artifacts/airflow/org.wikimedia.analytics.refinery.hive_refinery-hive_jar_shaded_0.1.25       (exists=True)
        https://archiva.wikimedia.org/repository/releases/org/wikimedia/analytics/refinery/hive/refinery-hive/0.1.25/refinery-hive-0.1.25-shaded.jar    (exists=True)

The next scap deploy worked.

Event Timeline

Same error today, but I may have found a pattern:

  1. scap deploy some code with a newly declared artifact https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/e177d87de0af1b0a2f025702b1a2cde74ae551d8#55aae31b9a4ff2843ef16aacd5af588dab5b3e70_21_22
  2. crash with the same error as described on 4/11
  3. scap rollback
  4. scap deploy again
  5. now it works

Bty, I noticed that workflow_utils was not up to date in an-launcher1002.eqiad.wmnet /usr/lib/airflow/lib/python3.7/site-packages/workflow_utils.

How to reproduce manually and currently, on an-launcher1002:

hdfs dfs -rm  /wmf/cache/artifacts/airflow/org.wikimedia.analytics.refinery.hive_refinery-hive_jar_shaded_0.1.27

# 2 times:
/usr/lib/airflow/bin/artifact-cache warm \
  /srv/deployment/airflow-dags/analytics/wmf_airflow_common/config/artifact_config.yaml \
  /srv/deployment/airflow-dags/analytics/analytics/config/artifacts.yaml

Seems to be a bug with fsspec + the new pyarrow API. I think we have to go back to not using the new pyarrow API for now. We can just avoid calling fsspec_use_new_pyarrow_api in the artifacts-cache script. Will make a patch.

Okay, fixed and deployed. All artifacts should be synced now.

The fixes and improvements are in this MR: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils/-/merge_requests/23