Page MenuHomePhabricator

Enable the airflow task/scheduler pods to communicate with our Hadoop clusters
Closed, ResolvedPublic

Description

We are currently starting to experiment with running jobs on Yarn via Skein in our Kubernetes Airflow instances (https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/887).

Skein will need to core-site.xml, hdfs-site.xml and yarn-site.xml configuration files to be deployed locally to be able to reach out to Hadoop, and we will need to have appropriate network policies to be put in place.

Event Timeline

airflow@airflow-scheduler-5757cb9dfc-sdz9x:/opt/airflow$ hdfs dfs -ls /
Found 5 items
drwxr-xr-x   - hdfs hadoop          0 2024-11-06 06:15 /system
drwxrwxrwt   - hdfs hdfs            0 2024-11-06 13:15 /tmp
drwxrwxr-x   - hdfs hadoop          0 2024-10-30 11:38 /user
drwxr-xr-x   - hdfs hdfs            0 2014-07-11 21:47 /var
drwxrwxr-x   - hdfs hadoop          0 2022-02-10 15:29 /wmf

Change #1087903 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: render the spark/hadoop/hdfs/yarn configuration files

https://gerrit.wikimedia.org/r/1087903

Change #1087903 merged by Brouberol:

[operations/deployment-charts@master] airflow: render the spark/hadoop/hdfs/yarn configuration files

https://gerrit.wikimedia.org/r/1087903