Our current airflow instances all run as particular system users, as specified in puppet:
Airflow instance | user name | user id | group name | group id |
analytics | analytics | 906 | analytics | 906 |
analytics_test | analytics | 906 | analytics | 906 |
search | analytics-search | 911 | analytics-search | 911 |
research | analytics-research | 912 | analytics-research | 912 |
platform_eng | analytics-platform-eng | 913 | analytics-platform-eng | 913 |
analytics_product | analytics-product | 910 | analytics-product | 910 |
wmde | analytics-wmde | 927 | analytics-wmde | 927 |
When the DAGs running on these instances interact with file on HDFS, with or without Hive or Presto as an intermediary, they do so as the configured system user.
This model of running Airflow ensures that any HDFS files generated are owned by that system user.
The group ownership of any file created is the same of the group owner of the parent directory. In most cases, this group is analytics-privatedata-users.
See: https://hadoop.apache.org/docs/r2.10.2/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Overview for reference about the group inheritance.
When a file or directory is created, its owner is the user identity of the client process, and its group is the group of the parent directory (the BSD rule).
The permissions of the generated files are governed by the umask of the process. We have a default umask of 027 set in /etc/hadoop/conf/core-site.xml
<name>fs.permissions.umask-mode</name> <value>027</value>
This means that files are readable by other members of the group, which is as intended.
In addition to this, each Airflow instance has its own corresponding Kerberos principal, which are as follows:
btullis@krb1001:~$ sudo kadmin.local list_principals|egrep 'airflow|analytics/an-launcher1002' analytics-platform-eng/an-airflow1004.eqiad.wmnet@WIKIMEDIA analytics-product/an-airflow1006.eqiad.wmnet@WIKIMEDIA analytics-research/an-airflow1002.eqiad.wmnet@WIKIMEDIA analytics-search/an-airflow1001.eqiad.wmnet@WIKIMEDIA analytics-search/an-airflow1005.eqiad.wmnet@WIKIMEDIA analytics-wmde/an-airflow1007.eqiad.wmnet@WIKIMEDIA analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA
When migrating Airflow to Kubernetes, we need to ensure that we maintain compatibility with this system, so that DAGs can continue to interact with Hadoop and HDFS in the same way that they do now.
We will be using the KubernetesExecutor, which means that it will only be the pods executed in the context of this executor that need to have the correct credentials configured.
The scheduler itself will not need to be running as the system user, since it will not be interacting with HDFS or any Hadoop services directly.
This task is about deciding on a suitable method and implementing it.
There are some useful resources to consider:
- https://airflow.apache.org/docs/apache-airflow/stable/security/workload.html#impersonation
- https://airflow.apache.org/docs/apache-airflow/stable/security/kerberos.html#hadoop
This work will cross over with on T375871: Integrate Airflow with Kerberos and its sub-tasks.