Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	BTullis
	Sep 27 2024, 3:54 PM

Description

Our current airflow instances all run as particular system users, as specified in puppet:

Airflow instance	user name	user id	group name	group id
analytics	analytics	906	analytics	906
analytics_test	analytics	906	analytics	906
search	analytics-search	911	analytics-search	911
research	analytics-research	912	analytics-research	912
platform_eng	analytics-platform-eng	913	analytics-platform-eng	913
analytics_product	analytics-product	910	analytics-product	910
wmde	analytics-wmde	927	analytics-wmde	927

When the DAGs running on these instances interact with file on HDFS, with or without Hive or Presto as an intermediary, they do so as the configured system user.

This model of running Airflow ensures that any HDFS files generated are owned by that system user.
The group ownership of any file created is the same of the group owner of the parent directory. In most cases, this group is analytics-privatedata-users.
See: https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Overview for reference about the group inheritance.

When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).

The permissions of the generated files are governed by the umask of the process. We have a default umask of 027 set in /etc/hadoop/conf/core-site.xml

<name>fs.permissions.umask-mode</name>
<value>027</value>

This means that files are readable by other members of the group, which is as intended.

In addition to this, each Airflow instance has its own corresponding Kerberos principal, which are as follows:

btullis@krb1001:~$ sudo kadmin.local list_principals|egrep 'airflow|analytics/an-launcher1002'
analytics-platform-eng/an-airflow1004.eqiad.wmnet@WIKIMEDIA
analytics-product/an-airflow1006.eqiad.wmnet@WIKIMEDIA
analytics-research/an-airflow1002.eqiad.wmnet@WIKIMEDIA
analytics-search/an-airflow1001.eqiad.wmnet@WIKIMEDIA
analytics-search/an-airflow1005.eqiad.wmnet@WIKIMEDIA
analytics-wmde/an-airflow1007.eqiad.wmnet@WIKIMEDIA
analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA

When migrating Airflow to Kubernetes, we need to ensure that we maintain compatibility with this system, so that DAGs can continue to interact with Hadoop and HDFS in the same way that they do now.

We will be using the KubernetesExecutor, which means that it will only be the pods executed in the context of this executor that need to have the correct credentials configured.
The scheduler itself will not need to be running as the system user, since it will not be interacting with HDFS or any Hadoop services directly.

This task is about deciding on a suitable method and implementing it.

There are some useful resources to consider:

This work will cross over with on T375871: Integrate Airflow with Kerberos and its sub-tasks.

Details

Subject	Repo	Branch	Lines +/-
airflow: Use the analytics user for task executor pods	operations/deployment-charts	master	+8 -0
airflow: update PYTHONPATH and set executor_pod_image_version	operations/deployment-charts	master	+4 -4
airflow: Use latest image	operations/deployment-charts	master	+1 -1
airflow: Use the latest airflow images by default	operations/deployment-charts	master	+1 -4
Airflow: update the kerberosexecutor settings with specified image tags	operations/deployment-charts	master	+3 -3
Allow overriding the airflow executor pod image	operations/deployment-charts	master	+10 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T362788 Migrate Airflow to the dse-k8s cluster
Open	None	T364389 Migrate the airflow scheduler components to Kubernetes
Resolved	BTullis	T375895 Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS

Event Timeline

BTullis created this task.Sep 27 2024, 3:54 PM

BTullis claimed this task.Sep 27 2024, 4:45 PM

BTullis triaged this task as High priority.

BTullis edited projects, added Data-Platform-SRE (2024.09.28 - 2024.10.18); removed Data-Platform-SRE.

BTullis updated the task description. (Show Details)

BTullis moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.Sep 27 2024, 5:44 PM

amastilovic subscribed.Sep 27 2024, 10:19 PM

I've been having a good think about this and I think that I have a way forward that will be workable.

One of the limiting factors that we have at the moment is that fact that Blubber does not have a facility for us to create multiple posix users within our containers.
The default is that applications are installed as someuser and processes are executed by runuser.
These two usernames can be overrideen by the .lives.as and .runs.as configurations and we can set their uid/gid numerical values, but that's it. We can't easily run adduser and create multiple arbitrary users per container.

My suggestion has two elements to it:

We create multiple airflow images to be used for the executor pods, one for each of the system users that run instances.

We can overide .runs.as for each of these images, so we would have a specific image that would be used by default for each instance's task pods.

We update the base image so that .runs.as is airflow and we grant this user the permission to use impersonation as certain users in the Hadoop nameserver configuration.

Within a DAG we have the option to override the image that is used, so when we want to start experimenting with user impersonation, we can do so.

But in the meantime, we have a relatively easy way forward where we don't need to depend on user impersonation.
The combination if these two measures means that we have an easy way forward

BTullis updated the task description. (Show Details)Sep 30 2024, 1:06 PM

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/17

Build multiple airflow images with specific uid/values for task runners

Change #1076764 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow overriding the airflow executor pod image

https://gerrit.wikimedia.org/r/1076764

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/17

Build multiple airflow images with specific uid/values for task runners

Change #1076764 merged by jenkins-bot:

[operations/deployment-charts@master] Allow overriding the airflow executor pod image

https://gerrit.wikimedia.org/r/1076764

Change #1076789 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use the latest airflow images by default

https://gerrit.wikimedia.org/r/1076789

Change #1076793 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Airflow: update the kerberosexecutor settings with specified image tags

https://gerrit.wikimedia.org/r/1076793

Change #1076793 merged by jenkins-bot:

[operations/deployment-charts@master] Airflow: update the kerberosexecutor settings with specified image tags

https://gerrit.wikimedia.org/r/1076793

Change #1076789 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use the latest airflow images by default

https://gerrit.wikimedia.org/r/1076789

Maintenance_bot removed a project: Patch-For-Review.Sep 30 2024, 5:31 PM

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/18

Move the airflow install files to /usr/local

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/18

Move the airflow install files to /usr/local

Change #1076832 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use latest image

https://gerrit.wikimedia.org/r/1076832

Change #1076832 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use latest image

https://gerrit.wikimedia.org/r/1076832

Maintenance_bot removed a project: Patch-For-Review.Sep 30 2024, 9:30 PM

Change #1076844 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: update PYTHONPATH and set executor_pod_image_version

https://gerrit.wikimedia.org/r/1076844

gerritbot added a project: Patch-For-Review.Sep 30 2024, 9:36 PM

Change #1076844 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: update PYTHONPATH and set executor_pod_image_version

https://gerrit.wikimedia.org/r/1076844

Maintenance_bot removed a project: Patch-For-Review.Sep 30 2024, 10:30 PM

Change #1077037 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use the analytics user for task executor pods

https://gerrit.wikimedia.org/r/1077037

gerritbot added a project: Patch-For-Review.Oct 1 2024, 2:29 PM

Change #1077037 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use the analytics user for task executor pods

https://gerrit.wikimedia.org/r/1077037

Maintenance_bot removed a project: Patch-For-Review.Oct 1 2024, 4:30 PM

This seems to be working, so far.
I have overridden the image in the pod template for the test-k8s instance to use the analytics variant and the unix user id is reported correctly. The DAGs are still running.

It might be beneficial to change the owner field of the DAGs too, but we will see.

I'll mark this task as waiting, pending further work on T375871: Integrate Airflow with Kerberos when shoud be able to start testing skein jobs.

btullis opened https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/21

Correct the uid/gid for analytics_research

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/21

Correct the uid/gid for analytics_research

Maintenance_bot removed a project: Patch-For-Review.Oct 3 2024, 12:30 PM

BTullis edited projects, added Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Data-Platform-SRE (2024.09.28 - 2024.10.18).Oct 18 2024, 3:07 PM

BTullis moved this task from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

The Kerberos token renewer has now been deployed to our Airflow instances. We now need to support Spark/Skein to test this, cf T377928

I think that we have done enough to say that we can resolve this ticket.
We have got HDFS interaction working from an Airflow scheduler pod with the right uid/gid values, as demonstrated here: T377602#10295327

If we discover any more issues with file system ownership, we can address them in subsequent tickets.

BTullis mentioned this in T362788: Migrate Airflow to the dse-k8s cluster.Nov 8 2024, 10:13 AM

	F57579362: image.png
	Oct 1 2024, 7:59 PM

	F57579351: image.png
	Oct 1 2024, 7:59 PM

Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFSClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS
Closed, ResolvedPublic
Actions

Related Objects
Search...