Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jan 26 2021, 12:01 PM

Description

The Analytics team would like to have a dedicated Airflow instance to start experimenting with it, like the Discovery team has been doing so far. The goal would be to start playing with Airflow with different workloads to see requirements, things missing, etc.. not to have a final RFC about how multiple teams should use Airflow. That will come in a later step :)

It would be nice to re-use all the work done, generalizing it a little bit to avoid the Discovery/Search specific bits. This assumes that Airflow doesn't really handle multi-tenancy, especially in the context of various kerberos credentials (for example, running jobs as analytics-search vs analytics).

Overall steps:

Create a new VM called an-airflow1002 in Ganeti (specs to be decided, but probably something close to an-airflow1001 as starter it is ok).
Generalize the gerrit search/airflow repository. It should contain only airflow-related things, but we may want to have a something under the analytics/airflow namespace as well. The main thing to figure out in my opinion would be how to handle different versions of airflow in the same repo (master branch vs version-specific branches etc..). We can also think about keeping the two repositories split for the moment. The discovery team runs Airflow 1.10.6 but the Analytics team might want to jump directly to 2.0.0 (it is already available in Pypi).
Generalize the puppet code to avoid Discovery/Search specific bits, but this shouldn't be too complicated.
Think about common plugins to share. IIUC the Discovery team already started to create some for swift upload, etc.. and it would be nice to share as much as possible :) This step can be done later on but it would be good if we start thinking about it as early as possible.

The final goal for this task is to have an-airflow1002 running (with the analytics user).

Details

Subject	Repo	Branch	Lines +/-
dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics	operations/puppet	production	+6 -1
airflow - Add support for configuring connections using LocalFilesystemBackend	operations/puppet	production	+95 -8
airflow::instance - allow access to API by default	operations/puppet	production	+5 -0
mariadb::instance - allow passing extra configs from hiera	operations/puppet	production	+28 -8
airflow - Expose admin details by default	operations/puppet	production	+5 -0
Set up airflow-analytics on an-launcher1002	operations/puppet	production	+26 -3
airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags	operations/puppet	production	+4 -1
airflow - add clean_logs.sh script	operations/puppet	production	+29 -0
airflow - add clean_logs wrapper script	operations/puppet	production	+12 -1
Subscribe airflow-webserver to webserver_config.py	operations/puppet	production	+3 -4
airflow - webserver host default to localhost, Admin for public role	operations/puppet	production	+49 -11
airflow-analytics-test - set db_user	operations/puppet	production	+2 -18
airflow test - ensure analytics instance is absent, add analytics-test	operations/puppet	production	+36 -0
Add airflow/dags/hello_world.py	analytics/refinery	master	+18 -0
airflow - use airflow.cfg for webserver port	operations/puppet	production	+8 -8
Airflow puppetization + airflow@analytics on an-test-coord1001	operations/puppet	production	+540 -0
Refactor Discovery's analytics airflow to be more generic	operations/puppet	production	+40 -42

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T282033 Airflow collaborations
Resolved	odimitrijevic	T271429 Replace Oozie with better workflow scheduler
Resolved	Ottomata	T272973 Generalize the current Airflow puppet/scap code to deploy a dedicated Analytics instance
Resolved	Ottomata	T277012 Create a debian package for Apache Airflow

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Ottomata the main problem that I can see how is that multi-tenancy is not really something that Airflow does well (and the people from Polidea confirmed this), especially in the context of having multiple kerberos service principals. For example, we'll run our jobs using the analytics user and its kerberos keytab, but discovery uses analytics-search, and other teams will probably do the same (like product analytics, etc..). This is why I was thinking about the multi-vm/stack setup, but if there is a simpler option we can definitely go for it.

Concur with regard to multi-tenancy, I tried to setup our airflow initially in a way that used the builtin multi-tenancy but as soon as I started integrating kerberos that had to be thrown out. I'm having a hard time describing why, but my general impression is that things will be simpler if teams with more than handful of tasks have their own instance. Some things in airflow happen in the global namespace, for example template variables must either be provided in the DAG() object instantiation or they must be globally defined in the airflow database (we load them from .json on scap deploy). Shared state that crosses repository boundaries seems like an undesirable property.

@EBernhardson thanks a lot for the feedback, we really need some experience in operating Airflow!

Let's assume for a moment that we solve the shared state issue (even if it sound a little complicated) - I see that in https://airflow.apache.org/docs/apache-airflow/stable/security/kerberos.html it is highlighted the fact that we could use an airflow kerberos service principal, and set hadoop to allow it to act as proxy (so able to impersonate users). This is how hive/oozie/etc.. work at the moment, but there is a clear way for the user to authenticate (namely via CLI tools). How does it work for Airflow? For example, how is a new DAG/job scheduled to run? Because if we don't have a good support for authenticating users the proxy thing is not helpful.

• fdans triaged this task as High priority.Jan 28 2021, 5:57 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.

• fdans added a project: Analytics-Kanban.

How does it work for Airflow? For example, how is a new DAG/job scheduled to run? Because if we don't have a good support for authenticating users the proxy thing is not helpful.

A new dag is scheduled by placing a python file in a configured directory (typically $AIRFLOW_HOME/dags) that creates DAG() objects. There isn't any kind of submission process or something like that. I do see reference online to other groups doing multi-tennancy (there is an airflow 2020 talk from EA about it) but in a quick look through the slides I didn't see anything about how they are restricting credentials. That specific presentation shows they implemented multi-tennancy by having each team commit to their own repository, and then the data team has some process that unifies and ships those repositories to airflow. I suppose somewhere in that process you could enforce some sort of conventions or requirements around auth.

@EBernhardson thanks a ton again for your insights, I watched quickly https://www.youtube.com/watch?v=u00wmcHe8ow (Airflow Summit 2020 EA) and I have a lot of thoughts now :)

As far as I get, deploying a new job/dag and managing it boils down to two macro things:

deploy the DAG to the Airflow instance
start/stop/etc.. DAGs via UI (or simply inspect them)

I'll try to break down the different deployment approaches and my understanding of them.

Single instance, multiple teams

In this case, the Airflow instance would run with the kerberos service user airflow and hadoop would be set to allow it to act as proxy on behalf of users. This means that any change to the repository containing DAGs should be vetted by the Analytics team, since anybody could potentially run a DAG with users that they shouldn't be able to manage (say running by mistake a DAG action as analytics or hdfs or any other system user that the user has no control over). There will also be the challenge of maintaining a global namespace, as Erik mentioned (plus a common set of plugins etc.. but that is something that we'll want to do anyway). As Erik mentioned EA allows people to deploy in separate repos, but I am not sure how they enforce what kind of privileges a single DAG can have (I don't think that they use kerberos).
The other thing to discuss is access to the Airflow UI. There seems to be SAML/SSO support (that should fit nicely with our CAS setup) so in theory we could expose something like airflow.wikimedia.org and allow people to login via CAS as we do for other UIs, and then use the RBAC settings to limit what users can and cannot do.

Downsides:

Vetting changes to airflow repository/repositories for DAG changes limits the ability of single teams to manage their jobs independently, and it adds work for the Analytics' team members.
Maintaining the RBAC rules could be a little cumbersome, but maybe we could end up with a decent solution that doesn't require a ton of work to manage it.
IIUC the performances of the Airflow scheduler require more and more resources as the number and complexity of DAGs increase.
Maintenance to the Airflow host would require to stop jobs for multiple teams (so to avoid pinging people every time, we'd need some convention about the fact that jobs can be stopped anytime etc..).
Resource needed by the single Airflow node could grow over time due to the scheduler's needs, requiring a single giant host with a ton of RAM and cores (maybe more the latter).

Good:

Single Airflow instance to maintain, no custom stacks for teams.
UI integration with CAS/SAML could be very nice (even if it would expose Airflow to Internet, so 2FA required for it in my opinion).

Multiple instances, one for each team

In this case, the Analytics team would provide a single Airflow stack (namely a VM running Airflow + a Mariadb instance) for each team that need one. The Airflow instances would run as the Kerberos system user assigned to the team (analytics, analytics-product, analytics-search, etc..) and every instance would grant ssh access to members of a single team (plus the Analytics team members of course). Access to the UI would be unrestricted, since we'd authenticate people via ssh.
Deploying any DAG change would be completely safe since the Analytics team will not be involved: Airflow would in fact run with a specific system user not able to proxy anything. This is the case of the Discovery team basically :)

Downsides:

Work on the Analytics team side to automate the creation of Airflow stacks (and managing them). For example, any team specific database would need to be replicated to db1108 and backed up. With a little effort all this should be reasonably easy, but it would require some work (order of a couple of hours per stack more or less). I don't foresee 100 Airflow stacks but possibly 3/4, so it should be manageable long term (last famous words).

Good:

Team independence in deploying DAGs.
Simpler security management (UI + Scheduler)
Less impact when a single stack needs maintenance (reboots etc..) and also the possibility to upgrade Airflow incrementally on separate stacks.
More controlled use of resources, reviewing what needed per team's stack. Purchase of new hardware per team if needed, rather than having a single giant Airflow host to run on.
Less impact when a change to a stack impairs Airflow, since it will not affect other stacks.

Preliminary thoughts

The multi-instance setup seems, so far, a more isolated and "secure" approach in my opinion, but it would mean a little more work on the Analytics team's shoulders (that can be automated of course). On the performance side, I am still not sure how Airflow scales horizontally (for example via Celery), so this needs to be followed up as well (we could have a shared Celery workers pool for all the stacks in the future if needed, but I need to check).

Please let me know your thoughts, our use case is different from others that I read on the Internet, but I am surely missing something. The more we discuss pros/cons the better :)

Thanks @elukey for the very interesting write.
I agree on the idea of having airflow instances per team. One nice thing with that approach is that it facilitates reusing the same scheduler for other aspects than analytics related ones (computation power to be kept in mind).
The downside of having to use SSH could be, on the long term, solved by having multiple airflow-behind-CAS URLs, particularly if CAS-2FA gets mandatory and the number of airflow instances is stable.
The question of computation power( CPU + RAM) needed per team, and how to scale airflow outside of the Analytics cluster will be of interest I'm sure :)

fkaelin added a subscriber: gmodena.Feb 1 2021, 1:41 PM

fkaelin subscribed.

Awesome summary @elukey Thanks!

I'd add 1 thing to the 'Downsides' of single-instance approach:

In case that a dynamic DAG breaks Airflow services, then it breaks all other team's jobs.

And 1 thing to the 'Goods' of multi-instance approach:

This approach is more in line with the Data-Engineering-as-a-platform idea that we were talking about recently in our team's meeting.

• EYener subscribed.Feb 2 2021, 8:30 PM

Add 1 thing to the 'Downsides' of multi-instance approach:

when doing maintenance, the Data Engineering team will have to rangle multiple airflow instances to stop jobs.

Overall, I think the multi-instance approach could make sense here. It's too bad though, something feels wrong about this. Yes we want data-eng-as-a-service, but its not like everybody has their own Hadoop cluster. It'd make more sense if our job scheduler itself was better about multi-tenancy and isolation. Oozie doesn't have this problem (but maybe that is just because it is less powerful/flexible?).

It just makes me wonder...is airflow really the right thing? Is good multi-tenancy a requirement we want in our job scheduler? It seems pretty important. If Airflow can't do it, is there something else that can? I can't say I understand all the issues here, but is multi-tenancy something that Kubernetes + Airflow could help with?

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 12 2021, 7:19 AM

Reporting some chats that we have been doing:

I followed up with upstream to see if there is a multi-tenancy solution different from the ones listed above, but they said no :(
As a team we decided to start with one Analytics-team-only Airflow instance, to start testing and working on it.
Kubernetes is definitely the long term step.
We'll set up a meeting for people interested in Airflow in a few weeks.

ttaylor subscribed.Feb 24 2021, 8:42 PM

Some updates: @razzi is going to take over the work during the next quarter :)

• razzi moved this task from In Progress to Next Up on the Analytics-Kanban board.Apr 6 2021, 7:31 PM

elukey mentioned this in T280905: Analytics coordinator failover improvements.Apr 22 2021, 7:12 AM

Milimetric added a parent task: T282033: Airflow collaborations.May 5 2021, 6:10 PM

Airflow 2 supports HA scheduler: https://airflow.apache.org/docs/apache-airflow/stable/scheduler.html#running-more-than-one-scheduler

but it won't work with MariaDB. Would it be worth considering using another database? PostGRES?

Oh, perhaps MariaDB will work for Airflow HA now that https://jira.mariadb.org/browse/MDEV-13115 is resolved. Would need a pretty recent MariaDB version though.

Ottomata added a parent task: T271429: Replace Oozie with better workflow scheduler.May 19 2021, 8:18 PM

Airflow 2 .deb proceeding in T277012. This .deb will only install a python environment with all the Airflow dependencies. Everything else will need to be configured via puppet.

Things to figure out:

Airflow database

Need to configure mysql usage. And also investigate if we can do Airflow HA with our version of MySQL.

DAG dir and distribution

We'll need to set a directory in which airflow scheduler will look for DAG files. Perhaps we can just add an airflow/dags directory in refinery and configure airflow scheduler to look there?

Integrations

hdfs, hive, presto, spark, sqoop, cassandra, druid. The dependencies all come as airflow extras and will be included in the .deb. I think that they won't need much system configuration; hopefully the configuration will just be DAG specific.

Kerberos

Looks like we need to run a separate ticket renewing service: airflow kerberos
https://airflow.apache.org/docs/apache-airflow/2.1.0/security/kerberos.html

Although, after reading https://airflow.apache.org/docs/apache-airflow/2.1.0/production-deployment.html#kerberos-authenticated-workers, perhaps the renewer service is only needed for authenticated workers? If all workers are local (which I think we are going to do for now), this might not be needed.

Also, it seems that there is some ability for Airflow's user to submit tasks to Hadoop proxied as other users (as Presto does). Perhaps this could help with multitenancy if we are also able to authenticate those users via LDAP or ssh somehow?

Logging

I guess logs should be pretty easily viewable via the Airflow web UI. If we do HA though, I'm not sure how that works. We might need to do some custom logging setup, maybe through logstash? https://airflow.apache.org/docs/apache-airflow/2.1.0/logging-monitoring/logging-tasks.html

We can probably figure this out and improve later.

Metrics

Airflow has built in support for statsd but we don't really use statsd anymore. We could if we realllly want to run a prometheus-statsd-exporter as a bridge from Airflow's statsd -> prometheus. Better would be to use a built in prometheus exporter. I found two. robinhood/airflow-prometheus-exporter looks like maybe it has a few more stats, but is also over a year old and might not work well with Airflow 2. epoch8/airflow-exporter is newer and should work, but maybe doesn't have scheduler metrics?

Alerting

Aside from the usual process monitoring, we should set up this health check too.

Dask executors?

We might be able to run Dask on Yarn and use it for remote Airflow executors. Not sure though. Let's figure this out later on after we get the initial stuff up and running.

Data governance?

TBD, but Airflow has integration with Atlas, allowing it to post information about jobs and data lineage.

Ottomata updated the task description. (Show Details)May 21 2021, 3:20 PM

There's a bunch of puppet code already that does the above, but I think we should start from scratch, since we'll be using a new package and new version of Airflow. Probably lots of stuff could be copy-pasted in from the existing airflow and profile::airflow classes. Perhaps for us we'll make airflow2* classes for now.

@Ottomata thanks for the summary & overview of the .deb status.

We have a dependency on the Papermill operator, which requires apache-airflow-providers-papermill to be installed atop airflow.

Would it be possible to include this dependency to the list of integrations for the .deb? Or would this required a dedicate package?

Would it be possible to include this dependency to the list of integrations for the .deb?

Sure will, just added it: https://gerrit.wikimedia.org/r/c/operations/debs/airflow/+/693222/9/build/versions.sh#20

Change 694514 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] [WIP] airflow 2

https://gerrit.wikimedia.org/r/694514

gerritbot added a project: Patch-For-Review.May 25 2021, 2:56 PM

@elukey @razzi

Airflow deb and puppetization patches are ready for review!

I'm sure this will take more iteration, but I'd like to apply this on an-test-coord1001 soon and continue iteration there.

@Marostegui mysql Q for ya:

https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database

Recommends that we set explicit_defaults_for_timestamp=1. Would that be ok to for our 'analytics-meta' MariaDB instance on an-coord1001? And, if we set it there, do we need to set it on its replicas too? For sure on the failure replica on an-coord1002, but what about the backup replica?

That variables is deprecated in MySQL, so you probably don't want to use it.

From reading https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_explicit_defaults_for_timestamp and https://mariadb.com/docs/reference/mdb/system-variables/explicit_defaults_for_timestamp/, I think Airflow is trying to enable the behavior that will be the default once explicit_defaults_for_timestamp variable is officially removed. That is, by turning it on, they enable the explicit behavior that a future MySQL version will make the default. The default value of that variable is currently OFF (right?) so without turning it on it will use the older more implicit timestamp defaults, which is what Airflow is trying to avoid.

The docs aren't super clear about this so I could be way off.

Yeah, my point is that if you use it now, if you get to upgrade mysql (depending on how hard they remove it) your server might not start again and you'll need to change the config to remove it.
Do you currently use a dedicated backup replica on your side or do you use the ones we (as DBA team) provide?

Luca is the best for docs: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Mysql_Meta#Backup

So DBA provided: db1108

you'll need to change the config to remove it.

Yeah makes sense. I could add a big ol comment around it about this.

db1108 is owned by Analytics yeah. So if you change it on the master, I would recommend changing it everywhere where this database is replicated to.

Change 694514 merged by Ottomata:

[operations/puppet@production] Airflow puppetization + airflow@analytics on an-test-coord1001

https://gerrit.wikimedia.org/r/694514

Change 696607 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - use airflow.cfg for webserver port

https://gerrit.wikimedia.org/r/696607

Change 696607 merged by Ottomata:

[operations/puppet@production] airflow - use airflow.cfg for webserver port

https://gerrit.wikimedia.org/r/696607

Maintenance_bot removed a project: Patch-For-Review.May 27 2021, 7:11 PM

Change 697600 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add airflow/dags/hello_world.py

https://gerrit.wikimedia.org/r/697600

Change 697600 merged by Ottomata:

[analytics/refinery@master] Add airflow/dags/hello_world.py

https://gerrit.wikimedia.org/r/697600

Mentioned in SAL (#wikimedia-operations) [2021-06-01T13:45:48Z] <otto@deploy1002> Started deploy [analytics/refinery@c0a02e5] (hadoop-test): deploy to an-test-coord1001 to get airflow/dags/hello_world.py - T272973

Mentioned in SAL (#wikimedia-operations) [2021-06-01T13:48:46Z] <otto@deploy1002> Finished deploy [analytics/refinery@c0a02e5] (hadoop-test): deploy to an-test-coord1001 to get airflow/dags/hello_world.py - T272973 (duration: 02m 58s)

Change 697603 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow test - ensure analytics instance is absent, add analytics-test

https://gerrit.wikimedia.org/r/697603

Change 697603 merged by Ottomata:

[operations/puppet@production] airflow test - ensure analytics instance is absent, add analytics-test

https://gerrit.wikimedia.org/r/697603

Maintenance_bot removed a project: Patch-For-Review.Jun 1 2021, 2:11 PM

Change 697607 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow-analytics-test - set db_user

https://gerrit.wikimedia.org/r/697607

gerritbot added a project: Patch-For-Review.Jun 1 2021, 2:19 PM

Change 697607 merged by Ottomata:

[operations/puppet@production] airflow-analytics-test - set db_user

https://gerrit.wikimedia.org/r/697607

Ottomata claimed this task.Jun 1 2021, 3:09 PM

Change 697615 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - webserver host default to localhost, Admin for public role

https://gerrit.wikimedia.org/r/697615

Change 697615 merged by Ottomata:

[operations/puppet@production] airflow - webserver host default to localhost, Admin for public role

https://gerrit.wikimedia.org/r/697615

Change 697617 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Subscribe airflow-webserver to webserver_config.py

https://gerrit.wikimedia.org/r/697617

Change 697617 merged by Ottomata:

[operations/puppet@production] Subscribe airflow-webserver to webserver_config.py

https://gerrit.wikimedia.org/r/697617

Change 697618 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] mariadb::instance - allow passing extra configs from hiera using default $template

https://gerrit.wikimedia.org/r/697618

FYI T283856: Airflow filled disks after losing connection to sql server is a bug we will likely encounter in Airflow 2, whatever fix happens there should be applied here too.

Change 697643 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - add clean_logs wrapper script

https://gerrit.wikimedia.org/r/697643

Change 697643 merged by Ottomata:

[operations/puppet@production] airflow - add clean_logs wrapper script

https://gerrit.wikimedia.org/r/697643

Change 697644 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - add clean_logs.sh script

https://gerrit.wikimedia.org/r/697644

Change 697644 merged by Ottomata:

[operations/puppet@production] airflow - add clean_logs.sh script

https://gerrit.wikimedia.org/r/697644

Change 697653 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set up airflow-analytics on an-launcher1002

https://gerrit.wikimedia.org/r/697653

Change 697841 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags

https://gerrit.wikimedia.org/r/697841

Change 697841 merged by Ottomata:

[operations/puppet@production] airflow-analytics-test - set dags folder to /srv/airflow-analytics-test-dags

https://gerrit.wikimedia.org/r/697841

Mentioned in SAL (#wikimedia-analytics) [2021-06-03T15:20:26Z] <ottomata> created airflow_analytics database and user on an-coord1001 analytics-meta instance - T272973

Change 697992 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] dumps-eqiad-analytics_meta.sql.erb - add grants for new airflow_analytics database

https://gerrit.wikimedia.org/r/697992

Change 697653 merged by Ottomata:

[operations/puppet@production] Set up airflow-analytics on an-launcher1002

https://gerrit.wikimedia.org/r/697653

Change 697999 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - Expose admin details by default

https://gerrit.wikimedia.org/r/697999

Change 697999 merged by Ottomata:

[operations/puppet@production] airflow - Expose admin details by default

https://gerrit.wikimedia.org/r/697999

We might be able to run Dask on Yarn and use it for remote Airflow executors

@JAllemandou I just tried this and got very close, but no luck. It would be possible to automate setting up a Dask in Yarn cluster and configuring airflow scheduler to use it. However, the issue is distributing the dags and airflow configs. For any remote executors, the dags and configs must be in sync. Unless there is a way to set AIRFLOW_HOME to hdfs and sync the airflow instance's configs and dags to HDFS, I don't see a clean way of doing remote executors. This would be a problem with celery too.

Will look into either local celery or local dask executors.

@mforns I just tried your pyarrow + hdfs + multiprocesing test using the version of pyarrow that comes with airflow now, and it works!

I also tried with the newer (non deprecated) pyarrow.fs.HadoopFileSystem, and it works great too!

from pyarrow.fs import HadoopFileSystem
import time
import random

if __name__ == '__main__':
    time.sleep(random.randint(1, 10))
    hdfs = HadoopFileSystem.from_uri('hdfs://analytics-test-hadoop/')
    time.sleep(random.randint(1, 10))
    f = hdfs.get_file_info('/user/mforns')
    print(f)

I think LocalExecutor will work if this works, right?

Oh, @Ottomata, I think that was the code that we wrote to try and reproduce the error.
We could not, meaning this test was passing fine, while Airflow was failing.
Which led us to believe, that the problem was Airflow serialization, which is the only part the test script does not reproduce.
When we met the Airflow committers, they acknowledged that problem, attributed it to the LocalExecutor, and recommended using Celery.
We can try to reproduce it within Airflow now, and even try to fix it again if it's still failing, though?!

I'd like to try to reproduce with LocalExecutor now if we can, just to be sure.

Change 697618 merged by Ottomata:

[operations/puppet@production] mariadb::instance - allow passing extra configs from hiera

https://gerrit.wikimedia.org/r/697618

Mentioned in SAL (#wikimedia-sre) [2021-06-07T16:49:50Z] <ottomata> restarting mysqld analytics-meta replica on db1108 to apply config change - T272973

Mentioned in SAL (#wikimedia-analytics) [2021-06-07T16:50:50Z] <ottomata> restarting mysqld analytics-meta replica on db1108 to apply config change - T272973

Ottomata moved this task from Next Up to Ready to Deploy on the Analytics-Kanban board.Jun 7 2021, 5:09 PM

Things to figure out update:

Airflow database

Done. No HA at this time.

DAG dir and distribution

We'll need to set a directory in which airflow scheduler will look for DAG files. Perhaps we can just add an airflow/dags directory in refinery and configure airflow scheduler to look there?

This will be deteremined per instance. For now we are using refinery/airflow/dags for analytics instance.

Integrations

Included in the .deb, still to test them all.

Kerberos

Done, running airflow-kerberos with each instance.

Logging

Logs should be available locally in the logs_folder, and also viewable using the web UI.

Metrics

I installed epoch8/airflow-exporter with the .deb, and some metrics are exposed, but I don't see any airflow specific ones yet. TBD.

Alerting

Set up the health check via puppet, but haven't enabled monitoring anywhere else to check yet.

Dask executors?

Not doing for now.

Data governance?

Still TBD, but outside of scope for this task.

Hey @Ottomata

DAG dir and distribution

We'll need to set a directory in which airflow scheduler will look for DAG files. Perhaps we can just add an airflow/dags directory in refinery and configure airflow scheduler to look there?
This will be deteremined per instance. For now we are using refinery/airflow/dags for analytics instance.

Look forward to experimenting with this, and happy you left room for instance specific implementation details.

Logging

Logs should be available locally in the logs_folder, and also viewable using the web UI.

Do you plan on setting up a log shipper to ELK? Or would you rather keep local as default, and let log handling/forwarding up to instance owners?

Metrics

I installed epoch8/airflow-exporter with the .deb, and some metrics are exposed, but I don't see any airflow specific ones yet. TBD.

This is really cool.

Do you plan on setting up a log shipper to ELK?

I had not planned on it, but I suppose we could! All logs will be on one node for now so it didn't seem to be a pressing neeed. If we ever do parallelize with remote executors somehow, then logstash would be more useful for sure.

Oh, another TODO: I think we'll need to puppetize a Local Filesystem Secrets Backend and define connections there. That way we can manage them with Puppet rather than having via a web UI.

Change 698808 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow - Add support for configuring connections using LocalFilesystemBackend

https://gerrit.wikimedia.org/r/698808

Ottomata closed subtask T277012: Create a debian package for Apache Airflow as Resolved.Jun 9 2021, 5:23 PM

Change 699968 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] airflow::instance - allow access to API by default

https://gerrit.wikimedia.org/r/699968

Change 699968 merged by Ottomata:

[operations/puppet@production] airflow::instance - allow access to API by default

https://gerrit.wikimedia.org/r/699968

Ottomata moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jun 21 2021, 3:06 PM

mforns moved this task from Operational Excellence to Airflow on the Analytics board.Jun 25 2021, 3:49 PM