We've come to realize that the UNIX user in the airflow contains does _not_ matter in terms of HDFS access and permissions. What does matter is the kerberos principal.
The following is a message sent by @BTullis :
We get a shell on the test-k8s scheduler container.
btullis@deploy2002:~$ kubectl exec -it airflow-scheduler-64f556f665-9qxmb -c airflow-production -- bash airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ id uid=900(airflow) gid=900(airflow) groups=900(airflow)
We list our HDFS home directory, which is /user/analytics
airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ hdfs dfs -ls|head -n 3 Found 18 items drwx------ - analytics analytics 0 2024-11-07 00:00 .Trash drwx------ - analytics analytics 0 2023-01-25 17:28 .flink
We write a file to it:
airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ echo "Hello World" | hdfs dfs -put - hello.txt
It's owned by analytics
airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ hdfs dfs -ls hello.txt -rw-r----- 3 analytics analytics 12 2024-11-07 12:39 hello.txt
We run a spark3-submit job as this user and it works:
airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ spark3-submit --master yarn --deploy-mode cluster /usr/local/lib/python3.9/site-packages/pyspark/examples/src/main/python/pi.py 10
It's all based on the default principal in the keytab.
airflow@airflow-scheduler-64f556f665-9qxmb:/opt/airflow$ klist|head -n2 Ticket cache: FILE:/tmp/airflow_krb5_ccache/krb5cc Default principal: analytics/airflow-test-k8s.discovery.wmnet@WIKIMEDIA
I bet that if we had multiple principals in the keytab (or multiple keytabs) we could probably run as different users by passing the spark.kerberos.principal parameter to spark:
https://spark.apache.org/docs/3.5.2/running-on-yarn.html#yarn-specific-kerberos-configuration
So, in light of that information, let's:
- only publish a single image in https://gitlab.wikimedia.org/repos/data-engineering/airflow/
- rewrite https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#UNIX_user_impersonation to mention the kerberos keytab and not the UNIX user
- deploy the repos/data-engineering/airflow image in all instances
- remove all fsGroup: 900 instances from charts/airflow if we can
- remove the obsolete docker images