Page MenuHomePhabricator

Evaluate Airflow's suitability for CI
Closed, ResolvedPublic

Description

Airflow is currently in-use by Search Platform for orchestrating ML workloads. It has been suggested as a tool that may be suitable for use as part of the CI infrastructure (cf: Seakeeper proposal).

We should initially evaluate this project using the same set of criteria used in the initial CI evaluations (as in T217325); i.e., using the CI Architecture document as a guide.

Further investigation may be warranted depending on initial evaluation outcome.

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani created this task.

Assigning to @LarsWirzenius per discussion.

'currently in use' as of last week in trial mode :)

Argo, Airflow, and CI

@thcipriani , @dduvall , and I met and discussed using Apache Airflow for the new
CI system. We decided against it. Airflow seems like an interesting
system, but it's not a CI system. It seems like it could be a building
block if we wanted to build our own CI system, but we'd rather not.
We'd rather use something like Argo, which is itself a CI system
already, not just a building block. We may need to build some
integration betwween Argo and Gerrit, and possibly some other things,
but for Airflow we'd need to build most of a CI system and that
doesn't seem like a sensible thing for us to do.

We also discussed division of responsibilities for running a CI based
on Argo. Dan had made a suggestion in his Seakeeper proposal document
where some responsibilities for Argo were marked as being for SRE. It
seems to us now that this would require SRE to understand Argo perhaps
more deeply than they would like to, and that RelEng should have the
full responsibility instead. RelEng would need SRE to provide a K8s
cluster on which to run CI, and allow CI jobs in that cluster to have
deployment capability to production. We're amending the Seakeeper
proposal accordingly.

We note the concern that if Argo moves forward fast and starts
depending on newer versions of K8s than what we have at the
foundation, this may prevent us from keeping up with newer versions of
Argo. This can, of course, happen to any software aimed at K8s,
whether it's CI or something else. We don't know how big a risk this
is.

We'd rather use something like Argo, which is itself a CI system already, not just a building block

For accuracy, IIUC Argo is a building block, from which Argo-CI is built. But point taken ya, we'd have to build 'Airflow-CI' which does not sound fun :)

Aye, Argo is what we call it, though it's actually many components.

We also have no opinion on whether Airflow is suitable for others. It doesn't really seem like Argo is necessarily good for the things others in this discussion need.

For accuracy, IIUC Argo is a building block, from which Argo-CI is built.

Not exactly! (Though it's a very reasonable assumption to make.)

Argo CI was an integration of Argo [Workflow], Argo UI, and a NodeJS service that integrated with various Git sources using webhooks. It's no longer maintained and was superseded in functionality by two different projects, each with their own specialized concerns:

  1. Argo Events which is a much more flexible and generalized system for consuming external events (including webhooks but also Kafka and many other sources), event gating logic, and spawning k8s resources using data from event payloads.
  2. Argo CD which is an opinionated integration of Workflow + UI + it's own API and controllers for orchestrating GitOps based deployments.

The setup we evaluated as a CI proof of concept—and is now being proposed for production use—used Argo Workflow to execute CI tasks, and Argo Events to integrate with Gerrit. We didn't evaluate Argo CI or Argo CD, the former because it's largely defunct, and the latter because we're not confident its opinionated CD model would fit with our future deployment pipeline needs.

See T218827: Evaluate Argo and T229246: Gerrit/Argo CI proof of concept.

Ah interesting, good to know.

Can Argo Events also produce external events, perhaps via Kafka? Something like that might make integration with other systems easier.

We may have to write a small adaptor, but other than that, yes.

Ah interesting, good to know.

Can Argo Events also produce external events, perhaps via Kafka? Something like that might make integration with other systems easier.

It can consume Kafka events but I'm not sure if it can produce them or directly propagate to other external systems in general—at least I haven't seen anything like that yet.

However, it can listen and act on internal k8s resource state changes and spawn additional resources such as jobs/pods/workflows to handle those changes. I used this kind of setup in the proof-of-concept to report workflow status back to Gerrit. However, it resulted in a ton of overhead as each project workflow completion resulted in another pod being spun up just to send a single request to Gerrit's API. As @LarsWirzenius mentioned, we're talking about implementing a small persistently running controller to perform this reporting instead, which could ostensibly also propagate workflow completion events to something like Kafka.