Page MenuHomePhabricator

Improvements of artifacts cache
Open, Needs TriagePublic

Description

Currently located here: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils
Bundled in: https://gerrit.wikimedia.org/r/admin/repos/operations/debs/airflow
Triggered by scap: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics

Improvements:

  • create an independent cache store by airflow instance
  • warn of unused artifacts when running workflow_utils/artifact/cli/warm By listing the cache and diffing with the yaml file
  • add an evict script (to be used in airflow-dags) to clean the cache from unspecified artifacts (the ones removed from artifacts.yaml)
  • Maybe move this artifact caching library into its own repo
  • Cached artifacts from Gitlab package 'download' links that are 'archives' (e.g. .tgz files) don't work with SparkSubmitOperator archives param. This param expects archive files to end in an extension like .tgz in order to automate unpacking the archive on the workers. This should be fixed. See: this MR: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/47#note_6838. Done in MR25

Event Timeline

Ottomata renamed this task from Improvements over artifacts cache to Improvements of artifacts cache.May 11 2022, 6:39 PM
Ottomata updated the task description. (Show Details)
Ottomata subscribed.

Thanks Antoine! I just added another needed fix too.

Change 793504 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/debs/airflow@debian] Release 2.1.4-py3.7-5

https://gerrit.wikimedia.org/r/793504

Mentioned in SAL (#wikimedia-analytics) [2022-05-19T16:59:53Z] <ottomata> deploying airflow-dags analytics with new artifact names, first clearing artifacts cache dir - T307115

Change 793504 abandoned by Ottomata:

[operations/debs/airflow@debian] Release 2.1.4-py3.7-5

Reason:

https://gerrit.wikimedia.org/r/793504