Page MenuHomePhabricator

Create a debian package for Apache Airflow
Closed, ResolvedPublic

Description

We should package the last version of https://pypi.org/project/apache-airflow (see the parent task for more info about what we are trying to use it for)

Event Timeline

I don't have permits to do:

ssh elukey@gerrit.wikimedia.org -p 29418 'gerrit create-project -d "Package Apache Airflow" operations/debs/airflow -o ldap/ops -p operations/debs'

After some chats with Riccardo and Moritz about how to package a Python app with dependencies not in Debian upstream, I ended up discovering that there is no clear way to do it :D

Most of the Python packages are libs with all deps in Debian, and the use cases with deps not in Debian are mostly in Scap (like we do for Superset and how Airflow is deployed by Search).

Long term it might be good to avoid scap to deploy a frozen venv, so an experiment that we could do is to have the venv created during build and then add a minimal debian packaging setup to deploy it to say /srv/airflow. I wouldn't add much more, like users or systemd units since we'll have to override them anyway, so the package should be in theory very straightforward (excluding the build part of course).

Had a chat with Andrew over meet, I'll try to come up with something similar to anaconda-wmf, since it is a lot of good things that might be useful. The idea would be to create a conda environment to then deploy as is to say /srv/airflow, and then add via puppet the systemd units. Let's see how it goes!

elukey removed elukey as the assignee of this task.Apr 7 2021, 6:35 AM

@Joe @akosiaris q:

I know using docker images in prod outside of k8s is not really done, but...could we? I also know we don't allow users to run docker images for security reasons, but would using puppet's service::docker define to run systemd services be ok or possible?

I ask because it'd be slick to use Deployment Pipeline and Blubber's python variant config to build an image from which we could run airflow services, rather than do a convoluted conda-pack/virtualenv+wheels + scap deployment.

@Joe @akosiaris q:

I know using docker images in prod outside of k8s is not really done, but...could we?

I would advise against it if you want to build a production level service on top of it. docker networking (and volumes, albeit a bit less) is quite cumbersome to work with in a production environment. We 've had this discussion in the past with other teams (in SRE as well). You will need to build big parts of what k8s already does (e.g. networking, service exposing, health checking, abstracting nodes, etc) to make it feasible to have a production level service (that is to say, it's not as easy as just running docker run myimage myargs).

I also know we don't allow users to run docker images for security reasons, but would using puppet's service::docker define to run systemd services be ok or possible?

That puppet construct was created to satisfy the bare minimum needed to have something resembling what gets deployed by the pipeline in deployment-prep (and that was just to keep some things still working), because actually getting the pipeline's components properly in deployment-prep is neigh to infeasible. It's tailored to that (very limited) use case, it's already showing its limits and I would argue it's not particularly well supported.

Note also that currently, the moment you got docker lying around, obtaining access to the docker socket equals root on the machine.

obtaining access to the docker socket equals root on the machine

This is a big one, ok sounds good,.

it's not as easy as just running docker run myimage myargs

Huh, ok. It was pretty easy in deployment-prep? But there's tons I don't know.

Thanks, no docker then. Will see about conda + debian package

obtaining access to the docker socket equals root on the machine

This is a big one, ok sounds good,.

it's not as easy as just running docker run myimage myargs

Huh, ok. It was pretty easy in deployment-prep?

Sure, but it's a single VM and probably a simple use case. So no trying to recover from failures automatically, no load balancing, no intricate dedicated networking needs, no health checking or node abstractions. Also probably not a lot of configuration was wanted or multiple configuration files.

Thanks, no docker then. Will see about conda + debian package

Sorry if I shot down the approach, I can see the allure into using docker/containers instead of other solutions (and in fact do use that approach in small personal projects myself constantly). But the moment requirements increase, it stops being a solution unless work is put in to somehow satisfy those requirements at which point the question of "why? there's something else that does all this already"pops up.

probably a simple use case

Ah, this is a simple use case too. We'd configure the service with puppet. Surely all the k8s infra we have is better, and we'd use that if we could here.

"why? there's something else that does all this already"

Airflow needs to work with Kerberos, so we can't use k8s. :(

Security concerns aside, using (Deployment Pipeline docker images) + (the usual puppet we do) vs using (conda or virtualenv/pip + debian package or scap) + (the usual puppet we do) still has all the downsides you mention: there's all the manual puppet we do for monitoring and load balancing, etc. I was hoping to benefit from Deployment Pipeline docker images to at least make the deployment part easier. The production configuration will be the same with or without it :/

Airflow would actually be perfect for k8s, it is stateless (all state is in a DB), if only it could work with Kerberos!

Change 693222 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/debs/airflow@debian] Initial debianization and 2.0.2-1~py3.7 release

https://gerrit.wikimedia.org/r/693222

@Ottomata FYI APT is currently broken on an-test-coord1001, for any operation it gives:

E: The package airflow needs to be reinstalled, but I can't find an archive for it.

As a result also debmonitor is broken on the host and once a day sends an email to root@ due to the systemd timer failure.
CC @MoritzMuehlenhoff for awareness.

Removed airflow package for now! sorry about that.

@Volans42 I've manually reinstalled our dev .deb on an-test-coord1001. What makes it needing to be reinstalled? I can remove it again if needed, but would also have to make a puppet patch too.

I'm waiting for a review of https://gerrit.wikimedia.org/r/c/operations/debs/airflow/+/693222/ before I merge and build and upload to apt.

@Ottomata nothing is needed AFAICT, APT is happy again, thanks.

Change 693222 merged by Ottomata:

[operations/debs/airflow@debian] Initial debianization and 2.1.0-py3.7-1 release

https://gerrit.wikimedia.org/r/693222