Page MenuHomePhabricator

Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS
Closed, ResolvedPublic

Description

Our current airflow instances all run as particular system users, as specified in puppet:

Airflow instanceuser nameuser idgroup namegroup id
analyticsanalytics906analytics906
analytics_testanalytics906analytics906
searchanalytics-search911analytics-search911
researchanalytics-research912analytics-research912
platform_enganalytics-platform-eng913analytics-platform-eng913
analytics_productanalytics-product910analytics-product910
wmdeanalytics-wmde927analytics-wmde927

When the DAGs running on these instances interact with file on HDFS, with or without Hive or Presto as an intermediary, they do so as the configured system user.

This model of running Airflow ensures that any HDFS files generated are owned by that system user.
The group ownership of any file created is the same of the group owner of the parent directory. In most cases, this group is analytics-privatedata-users.
See: https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Overview for reference about the group inheritance.

When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).

The permissions of the generated files are governed by the umask of the process. We have a default umask of 027 set in /etc/hadoop/conf/core-site.xml

<name>fs.permissions.umask-mode</name>
<value>027</value>

This means that files are readable by other members of the group, which is as intended.

In addition to this, each Airflow instance has its own corresponding Kerberos principal, which are as follows:

btullis@krb1001:~$ sudo kadmin.local list_principals|egrep 'airflow|analytics/an-launcher1002'
analytics-platform-eng/an-airflow1004.eqiad.wmnet@WIKIMEDIA
analytics-product/an-airflow1006.eqiad.wmnet@WIKIMEDIA
analytics-research/an-airflow1002.eqiad.wmnet@WIKIMEDIA
analytics-search/an-airflow1001.eqiad.wmnet@WIKIMEDIA
analytics-search/an-airflow1005.eqiad.wmnet@WIKIMEDIA
analytics-wmde/an-airflow1007.eqiad.wmnet@WIKIMEDIA
analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA

When migrating Airflow to Kubernetes, we need to ensure that we maintain compatibility with this system, so that DAGs can continue to interact with Hadoop and HDFS in the same way that they do now.

We will be using the KubernetesExecutor, which means that it will only be the pods executed in the context of this executor that need to have the correct credentials configured.
The scheduler itself will not need to be running as the system user, since it will not be interacting with HDFS or any Hadoop services directly.

This task is about deciding on a suitable method and implementing it.

There are some useful resources to consider:

This work will cross over with on T375871: Integrate Airflow with Kerberos and its sub-tasks.

Event Timeline

BTullis triaged this task as High priority.
BTullis updated the task description. (Show Details)

I've been having a good think about this and I think that I have a way forward that will be workable.

One of the limiting factors that we have at the moment is that fact that Blubber does not have a facility for us to create multiple posix users within our containers.
The default is that applications are installed as someuser and processes are executed by runuser.
These two usernames can be overrideen by the .lives.as and .runs.as configurations and we can set their uid/gid numerical values, but that's it. We can't easily run adduser and create multiple arbitrary users per container.

My suggestion has two elements to it:

  1. We create multiple airflow images to be used for the executor pods, one for each of the system users that run instances.

We can overide .runs.as for each of these images, so we would have a specific image that would be used by default for each instance's task pods.

  1. We update the base image so that .runs.as is airflow and we grant this user the permission to use impersonation as certain users in the Hadoop nameserver configuration.

Within a DAG we have the option to override the image that is used, so when we want to start experimenting with user impersonation, we can do so.

But in the meantime, we have a relatively easy way forward where we don't need to depend on user impersonation.
The combination if these two measures means that we have an easy way forward

Change #1076764 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Allow overriding the airflow executor pod image

https://gerrit.wikimedia.org/r/1076764

btullis merged https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/17

Build multiple airflow images with specific uid/values for task runners

Change #1076764 merged by jenkins-bot:

[operations/deployment-charts@master] Allow overriding the airflow executor pod image

https://gerrit.wikimedia.org/r/1076764

Change #1076789 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use the latest airflow images by default

https://gerrit.wikimedia.org/r/1076789

Change #1076793 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Airflow: update the kerberosexecutor settings with specified image tags

https://gerrit.wikimedia.org/r/1076793

Change #1076793 merged by jenkins-bot:

[operations/deployment-charts@master] Airflow: update the kerberosexecutor settings with specified image tags

https://gerrit.wikimedia.org/r/1076793

Change #1076789 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use the latest airflow images by default

https://gerrit.wikimedia.org/r/1076789

Change #1076832 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use latest image

https://gerrit.wikimedia.org/r/1076832

Change #1076832 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use latest image

https://gerrit.wikimedia.org/r/1076832

Change #1076844 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: update PYTHONPATH and set executor_pod_image_version

https://gerrit.wikimedia.org/r/1076844

Change #1076844 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: update PYTHONPATH and set executor_pod_image_version

https://gerrit.wikimedia.org/r/1076844

Change #1077037 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use the analytics user for task executor pods

https://gerrit.wikimedia.org/r/1077037

Change #1077037 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use the analytics user for task executor pods

https://gerrit.wikimedia.org/r/1077037

This seems to be working, so far.
I have overridden the image in the pod template for the test-k8s instance to use the analytics variant and the unix user id is reported correctly. The DAGs are still running.

image.png (683×1 px, 145 KB)

It might be beneficial to change the owner field of the DAGs too, but we will see.
image.png (678×1 px, 117 KB)

I'll mark this task as waiting, pending further work on T375871: Integrate Airflow with Kerberos when shoud be able to start testing skein jobs.

The Kerberos token renewer has now been deployed to our Airflow instances. We now need to support Spark/Skein to test this, cf T377928

I think that we have done enough to say that we can resolve this ticket.
We have got HDFS interaction working from an Airflow scheduler pod with the right uid/gid values, as demonstrated here: T377602#10295327

If we discover any more issues with file system ownership, we can address them in subsequent tickets.