Implement a mechanism that will, on MR merge into the airflow-dags GitLab repository, facilitate an automatic "cache warming" of artifacts found in artifacts.yaml files of all the different Airflow deployments supported in that repository.
"Cache warming" means to pre-emptively transfer an artifact file from its source into its configured cache location(s). Most of the artifacts configured for use in airflow-dags are sourced from Maven repositories, and are configured to be cached in HDFS. Our artifact libraries support many other types of artifact sources and caches though: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils
What this automatic mechanism should do is, for each configured artifacts.yaml file in all the Airflow deployment directories, for each artifact configured in it:
- Determine the cache location(s) of the artifact
- Check if the artifact is present in the cache location(s)
- If not present, copy the artifact from its source into the HDFS location
Done is:
- There is a GitLab CI/CD pipeline job in airflow-dags that facilitates the above described mechanism for cache warming
- This pipeline job executes automatically on a successful merge request into the main branch
- This pipeline job processes all of the artifacts.yaml configurations in airflow-dags repository
- This pipeline job supports sourcing artifacts from all configured sources, and putting them in all configured caches