Page MenuHomePhabricator

Facilitate automatic artifact cache warming for airflow-dags artifacts
Closed, ResolvedPublic

Description

Implement a mechanism that will, on MR merge into the airflow-dags GitLab repository, facilitate an automatic "cache warming" of artifacts found in artifacts.yaml files of all the different Airflow deployments supported in that repository.

"Cache warming" means to pre-emptively transfer an artifact file from its source into its configured cache location(s). Most of the artifacts configured for use in airflow-dags are sourced from Maven repositories, and are configured to be cached in HDFS. Our artifact libraries support many other types of artifact sources and caches though: https://gitlab.wikimedia.org/repos/data-engineering/workflow_utils

What this automatic mechanism should do is, for each configured artifacts.yaml file in all the Airflow deployment directories, for each artifact configured in it:

  1. Determine the cache location(s) of the artifact
  2. Check if the artifact is present in the cache location(s)
  3. If not present, copy the artifact from its source into the HDFS location

Done is:

  • There is a GitLab CI/CD pipeline job in airflow-dags that facilitates the above described mechanism for cache warming
  • This pipeline job executes automatically on a successful merge request into the main branch
  • This pipeline job processes all of the artifacts.yaml configurations in airflow-dags repository
  • This pipeline job supports sourcing artifacts from all configured sources, and putting them in all configured caches

Event Timeline

Copy pasting here a discussion from this gitlab code-review: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1309.

@xcollazo:
we do need this analytics/config/artifacts.yaml file still, unfortunately.
And we need to keep it in sync with main/config/artifacts.yaml, to be able to deploy artifacts via scap deploy as per docs.
I would of course love to get rid of this madness.

@JAllemandou:
I'd like to advocate for a change of procedure for the deployment of JARs onto the cache.
Instead of keeping the analytics and main artifacts.yaml files in sync, I suggest we document manually sync'ing artifacts when needed, using the command:

/usr/lib/airflow/bin/artifact-cache warm /path/to/artifact.yaml

We would need to run this comment on the Airflow pod for the config folder to be present, I'm sure this would be feasible.
Is this an acceptable idea?

@xcollazo:
A refinement of this idea: why can't we have the sync process that runs every 5 minutes also call that script? I'd rather it be automated.

/usr/lib/airflow/bin/artifact-cache warm /path/to/artifact.yaml

We would need to run this comment on the Airflow pod for the config folder to be present, I'm sure this would be feasible.
Is this an acceptable idea?

@xcollazo:
A refinement of this idea: why can't we have the sync process that runs every 5 minutes also call that script? I'd rather it be automated.

We have been discussing this in this Slack thread and I think we have a workable solution.

The outcome would be as follows:

  • The gitsync pod in each instance would run artifact-cache warm /path/to/artifact.yaml after every change to the main branch, or its configured feature branch.
  • Therefore, both DAGs and artifacts would be deployed automatically every 5 minutes.

In order to make this happen, we would use the --exechook-command of git-sync. (Docs: https://github.com/kubernetes/git-sync?tab=readme-ov-file#manual)

We would need the following additional elements in place:

  • Access to the Kerberos credential cache from the git-sync pod.
  • Access to the Hadoop config files from the git-sync pod.
  • Access to the artifact-cache binary from the git-sync pod.
  • Network policies permitting git-sync to access HDFS.
  • A boolean to enable/disable artifcat sync, in case a particular instance owner wishes to disable it.

I think that would be enough to allow us to stop using scap for artifact sync and would mean that we can decommission all of the Airflow VMs.
@amastilovic @xcollazo @JAllemandou - What do you think?

I think that would be enough to allow us to stop using scap for artifact sync and would mean that we can decommission all of the Airflow VMs.
@amastilovic @xcollazo @JAllemandou - What do you think?

I like this a lot, but I am afraid of hitting T391123 all the time. (TL;DR: If we run sync every 5 minutes, all Gitlab artifacts will *also* be redownloaded every 5 minutes, regardless of whether they changed or not, because of a Gitlab bug.) So it feels like we need a solution to that bug first before moving forward with this?

I think that would be enough to allow us to stop using scap for artifact sync and would mean that we can decommission all of the Airflow VMs.
@amastilovic @xcollazo @JAllemandou - What do you think?

I like this a lot, but I am afraid of hitting T391123 all the time. (TL;DR: If we run sync every 5 minutes, all Gitlab artifacts will *also* be redownloaded every 5 minutes, regardless of whether they changed or not, because of a Gitlab bug.) So it feels like we need a solution to that bug first before moving forward with this?

+1 - I also think we need a solution before moving. Would a simple/not perfect check of existence be enough for now?

Would a simple/not perfect check of existence be enough for now?

Agreed. I think that was also @amastilovic suggestions in today's sync up.

I think that would be enough to allow us to stop using scap for artifact sync and would mean that we can decommission all of the Airflow VMs.
@amastilovic @xcollazo @JAllemandou - What do you think?

I like this a lot, but I am afraid of hitting T391123 all the time. (TL;DR: If we run sync every 5 minutes, all Gitlab artifacts will *also* be redownloaded every 5 minutes, regardless of whether they changed or not, because of a Gitlab bug.) So it feels like we need a solution to that bug first before moving forward with this?

The manual for git-sync says this:

--exechook-command <string>, $GITSYNC_EXECHOOK_COMMAND
            An optional command to be executed after syncing **a new hash** of the
            remote repository.

(With emphasis on a new hash)

So it won't be re-downloading everything every 5 minutes. It will only run artifact-cache warm once, each time there is a new commit to main.
This makes me think that you wouldn't run into T391123 any more than happens now, but I may be missing something.

I’d like to hold off on decommissioning an-launcher1002 for now. It still hosts the analytics and hdfs sudoable users that I (and possible others) rely on for development and debugging. Can we discuss an alternative plan or ensure equivalent access elsewhere before removing it?

I’d like to hold off on decommissioning an-launcher1002 for now. It still hosts the analytics and hdfs sudoable users that I (and possible others) rely on for development and debugging. Can we discuss an alternative plan or ensure equivalent access elsewhere before removing it?

Good call.
an-launcher1002 also handles the various systemd-timers we have left, so no decom' before all that has been resolved :)

Change #1159563 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add our legacy archiva instance to kubernetes external_services

https://gerrit.wikimedia.org/r/1159563

Change #1159579 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow blunderbuss to contact archiva

https://gerrit.wikimedia.org/r/1159579

Change #1159563 merged by Btullis:

[operations/puppet@production] Add our legacy archiva instance to kubernetes external_services

https://gerrit.wikimedia.org/r/1159563

Change #1159579 merged by jenkins-bot:

[operations/deployment-charts@master] Allow blunderbuss to contact archiva

https://gerrit.wikimedia.org/r/1159579

OK, so now that we have the necessary updates to the workflow_utils library that will enable cache warming on base level, I think we should revisit the decision on which mechanism to use to facilitate it.

git-sync with workflow_utils CLI

Summary: git-sync is configured to periodically pull the latest changes from the airflow-dags repository. Use a git-sync hook to run the workflow_utils CLI cache warming command in case any repository files have changed since the latest repo sync.

Pros:

  1. Simple concept, very few moving parts.

Cons:

  1. Limited visibility - not easy to check if all artifacts have been successfully warmed up in cache, requires users to switch context away from Gitlab and its CI/CD interface and log into a K8s pod and its logs.
  2. Will require considerable update to the git-sync helm deployment. We will need a new Docker image containing both git-sync and JRE and Hadoop libraries, additional environment variables configured, workflow_utils properly installed.

Blunderbuss cache warming feature

Summary: On a "merge to main branch" event, Blunderbuss service receives a call from GitLab CI/CD to its cache warming endpoint. It performs cache warming using workflow_utils library and returns status back to GitLab CI/CD.

Pros:

  1. Greater visibility directly in Gitlab's user interface. Cache warming is just another job in the CI/CD pipeline with its red/green status, plus log messages are easily accessible.
  2. Already configured to work in a Helm/K8s environment.

Cons:

  1. A more complicated system with more moving parts.
  2. Blunderbuss CI/CD component needs an update to support the new post/poll usage pattern.

I’d like to hold off on decommissioning an-launcher1002 for now. It still hosts the analytics and hdfs sudoable users that I (and possible others) rely on for development and debugging. Can we discuss an alternative plan or ensure equivalent access elsewhere before removing it?

Good call.
an-launcher1002 also handles the various systemd-timers we have left, so no decom' before all that has been resolved :)

The only issue I have with this is that an-launcher1002 is now well beyond its EoL and DC-Ops have been asking us when we can decommission it.
It's on this EoL servers spreadsheet and was purchased in 2018.

How about if we were to create a new virtual machine, say an-launcher1003, and put this in the same puppet role: analytics_cluster::launcher
This would give you a place to use the hdfs and analytics users and it would have the same systemd timers as an-launcher1002.
We would have to be careful not to duplicate the timers during a switch-over, but it should be relatively easy now.

Would that be OK for you @JAllemandou @Antoine_Quhen ?
See also: T353786: Decommission an-launcher1002

OK, so now that we have the necessary updates to the workflow_utils library that will enable cache warming on base level, I think we should revisit the decision on which mechanism to use to facilitate it.

git-sync with workflow_utils CLI

Summary: git-sync is configured to periodically pull the latest changes from the airflow-dags repository. Use a git-sync hook to run the workflow_utils CLI cache warming command in case any repository files have changed since the latest repo sync.

Pros:

  1. Simple concept, very few moving parts.

Cons:

  1. Limited visibility - not easy to check if all artifacts have been successfully warmed up in cache, requires users to switch context away from Gitlab and its CI/CD interface and log into a K8s pod and its logs.
  2. Will require considerable update to the git-sync helm deployment. We will need a new Docker image containing both git-sync and JRE and Hadoop libraries, additional environment variables configured, workflow_utils properly installed.

Blunderbuss cache warming feature

Summary: On a "merge to main branch" event, Blunderbuss service receives a call from GitLab CI/CD to its cache warming endpoint. It performs cache warming using workflow_utils library and returns status back to GitLab CI/CD.

Pros:

  1. Greater visibility directly in Gitlab's user interface. Cache warming is just another job in the CI/CD pipeline with its red/green status, plus log messages are easily accessible.
  2. Already configured to work in a Helm/K8s environment.

Cons:

  1. A more complicated system with more moving parts.
  2. Blunderbuss CI/CD component needs an update to support the new post/poll usage pattern.

We discussed this in a recent sync meeting and the consensus was to proceed with the blunderbuss approach.
The only reservation was about the use of an sqlite database in an ephemeral pod file system, to act as the backing store for the huey task queue.

Personally, I'd prefer a backing store that supports persistence, whether that be redis, or a cephfs file system approach.
I know that we are unlikely to need to scale this service up very much, but it still feels to me like we should try to make the task queue backend for blunderbuss as robust as possible.

amastilovic updated the task description. (Show Details)

Change #1171732 had a related patch set uploaded (by Aleksandar Mastilovic; author: Aleksandar Mastilovic):

[operations/deployment-charts@master] Blunderbuss helm chart that works with the new Blunderbuss versions

https://gerrit.wikimedia.org/r/1171732

Change #1171732 merged by Brouberol:

[operations/deployment-charts@master] Blunderbuss helm chart that works with the new Blunderbuss versions

https://gerrit.wikimedia.org/r/1171732