Page MenuHomePhabricator

Create airflow instances for Platform Engineering and Research
Closed, ResolvedPublic

Description

In today's Analytics Systems hangtime meeting, we talked with @fkaelin and @gmodena about work they want to do with Airflow. I told them that Airflow is basically ready for testing, and we could create instances for them now if they liked. They do like! They understand that we are still iterating and figuring it out for ourselves too. It will do us all good to be able to work out best practices together.

Let's create instances for them now. We haven't done this before, so we'll likely need to formalize this process. It will be something like:

  1. Create new Ganeti VMs: an-airflow1002 (research), an-airflow1003 (platform eng)
  1. Create new system users: analytics-research, analytics-platform-eng. Declare these system users in profile::analytics::cluster::users These users should also be added in admin data.yaml, but commented out until T231067 is complete (as other system users are). See analytics-search as an example.
  1. Create new user groups analytics-research-users and analytics-platform-eng-users and include relevant users in members and system user in system_members. Members in these groups should have sudo privileges to their system user. Also include the system users in the analytics-privatedata-users group. See analytics-search-users as an example. We'll also need to make sure users in these groups can manage airflow services. See airflow-search-admins for an example. Q: should these be -admins groups instead of -users groups? Perhaps if they need sudo privs they should.
  1. Create kerberos principals and keytabs for these system users @ their airflow VM hostname following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos#Create_a_keytab_for_a_service
  1. Create the airflow instances on the VMs following https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Creating_a_new_Airflow_Instance

Event Timeline

@Ottomata what is the plan for the related databases?

@elukey I think the thing to do for now is create them in analytics meta mariadb instance, and then refactor everything as part of T284150: Bring an-mariadb100[12] into service when we have new hardware.

We could consider hosting the mariadb instances on the ganeti VMs themselves. I think I'd rather have them on dedicated hardware in the same place, especially since replication setup is manual; I think it will be easier to manage that in one place.

In this case make sure to check the max-conns + innodb buffer for meta, I am pretty sure that there may be some tuning to do :)

Ottomata updated the task description. (Show Details)
Ottomata added a project: Analytics-Kanban.

@razzi, I think we can go ahead and create an-airflow1002 for platform eng. Could you make that happen? :) TY!

Oh right, there is more to do than just https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow#Creating_a_new_Airflow_Instance. I'll see if I can figure out some of the earlier system user steps now.

Change 701112 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] DRY profile::analytics::cluster::users

https://gerrit.wikimedia.org/r/701112

Change 701112 merged by Ottomata:

[operations/puppet@production] DRY profile::analytics::cluster::users

https://gerrit.wikimedia.org/r/701112

Hey yall, quick update: We haven't been working on this as the Gobblin migration has been taking a lot of my time (should be done soon), but also I'm hoping to wait until T278423 is 100% done (should be early next week). Once everything is on Buster, some of the system user account creation can be improved and made much easier from our side. We'll have to do that to make your airflow system users.

So, I hope to get to work on this again in the next couple of weeks.

Change 708159 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add system users and groups for for Airflow for Research and Platform Eng

https://gerrit.wikimedia.org/r/708159

Change 708159 merged by Ottomata:

[operations/puppet@production] Add system users and groups for for Airflow for Research and Platform Eng

https://gerrit.wikimedia.org/r/708159

Created kerberos principals and keytabs:

[@krb1001:/home/otto] $ cat airflow-keytabs.list
an-airflow1002.eqiad.wmnet,create_princ,analytics-research
an-airflow1002.eqiad.wmnet,create_keytab,analytics-research
an-airflow1003.eqiad.wmnet,create_princ,analytics-platform-eng
an-airflow1003.eqiad.wmnet,create_keytab,analytics-platform-eng

[@krb1001:/home/otto] $ sudo generate_keytabs.py --realm WIKIMEDIA airflow-keytabs.list
analytics-research/an-airflow1002.eqiad.wmnet@WIKIMEDIA

Created airflow databases on an-coord1001:

CREATE DATABASE airflow_research;
CREATE USER 'airflow_research' IDENTIFIED BY 'xxxxxx';
GRANT ALL PRIVILEGES ON airflow_research.* TO 'airflow_research';

CREATE DATABASE airflow_platform_eng;
CREATE USER 'airflow_platform_eng' IDENTIFIED BY 'xxxxxx';
GRANT ALL PRIVILEGES ON airflow_platform_eng.* TO 'airflow_platform_eng';

Change 708583 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Install hadoop client on an-airflow1002

https://gerrit.wikimedia.org/r/708583

Change 708590 had a related patch set uploaded (by Ottomata; author: Ottomata):

[labs/private@master] Add dummy keytabs for analytics-research and analytics-platform-eng airflow

https://gerrit.wikimedia.org/r/708590

Change 708590 merged by Ottomata:

[labs/private@master] Add dummy keytabs for analytics-research and analytics-platform-eng airflow

https://gerrit.wikimedia.org/r/708590

Change 708583 merged by Ottomata:

[operations/puppet@production] Set up airflow-research instance on an-airflow1002

https://gerrit.wikimedia.org/r/708583

Change 708609 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set up airflow@platform_eng instance on an-airflow1003

https://gerrit.wikimedia.org/r/708609

Change 708609 merged by Ottomata:

[operations/puppet@production] Set up airflow@platform_eng instance on an-airflow1003

https://gerrit.wikimedia.org/r/708609

Ok! @fkaelin @gmodena @Clarakosi I've set up airflow instances for you all! Consider them still a bit WIP and don't start to rely on them yet, but you can experiment and develop for sure!

We still need lots more documentation, but instructions for accessing your instances are here:

All the users in the platform-engineering posix group can log into the airflow1003.eqiad.wmnet instance.

@fkaelin, for now, I only added you in the new analytics-research-admins group. Let me know who else to add!

I need to follow up and make sure database backups and replication work properly, but I think I will do that as part of T284150: Bring an-mariadb100[12] into service once we get hardware.

@Ottomata - FYI I spotted this on an-test-coord1001 this morning.

Warning: /Stage[main]/Profile::Airflow/Airflow::Instance[analytics-test]/File[/srv/airflow-analytics-test]: Could not back up file of type directory
Notice: /Stage[main]/Profile::Airflow/Airflow::Instance[analytics-test]/File[/srv/airflow-analytics-test]: Not removing directory; use 'force' to override
Notice: /Stage[main]/Profile::Airflow/Airflow::Instance[analytics-test]/File[/srv/airflow-analytics-test]/ensure: removed (corrective)

I guess it's just trying to do an ensure absent on a directory here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/airflow/manifests/instance.pp#191

We could fix it for now by manually removing the directory, but I thought you might like to know about it for future reference.
I think it's probably OK just to add a force => true to the file resource as well.

Oh thanks, cool, that's from when we moved this instance over to an-test-client for kerberos reasons. Hm yeah let's add force => true. doing.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/709044

Hey @Ottomata,

Many thanks for this! Just wanted to give an ack that login on the host worked.

Terrific!

All the users in the platform-engineering posix group can log into the airflow1003.eqiad.wmnet instance.

Minor thing; I guess the instance host is an-airflow1003.eqiad.wmnet (just for future reference).

Submitting a spark job from an airflow instance results in a hadoop/hdfs permission error AccessControlException: Permission denied: user=analytics-research, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x.

The example job doesn't read or write from hdfs, presumably it is spark attempting to copy temp files. Is this likely submit configuration issue, or does this user need additional permissions and/or a home directory to run spark/hadoop jobs?

I'm a little fuzzy here but I do know this is because there's no /user/analytics-research directory in hdfs, and that we can create such a thing, as we did for analytics-search for example. I'm not sure what the process is, so I made T290918

Change 721601 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add analytics-research and analytics-platform-eng to analytics-privatedata-users

https://gerrit.wikimedia.org/r/721601

Change 721601 merged by Ottomata:

[operations/puppet@production] Add analytics-research and analytics-platform-eng to analytics-privatedata-users

https://gerrit.wikimedia.org/r/721601