Page MenuHomePhabricator

Migrate the airflow scheduler components to Kubernetes
Closed, ResolvedPublic

Description

As part of "Migrate Airflow to the dse-k8s cluster" - T362788

We will migrate the webserver components first, then migrate the schedulers afterwards, once we have carried out more testing.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Don't propagate the container SPARK_HOME to the hadoop workersrepos/data-engineering/airflow-dags!929brouberolT364389main
Ensure SPARK_HOME=/usr/lib/spark3repos/data-engineering/airflow!43brouberolT364389main
Install missing libhdfsrepos/data-engineering/airflow!42brouberolT364389main
Install libsasl libraries (take 3)repos/data-engineering/airflow!41brouberolT364389main
Install libsasl libraries (take 2)repos/data-engineering/airflow!40brouberolT364389main
Install libsasl librariesrepos/data-engineering/airflow!39brouberolT364389main
Add missing dependenciesrepos/data-engineering/airflow!38brouberolT364389main
Cleanup scheduler logs as part of the purge_old_logs_from_s3 DAGrepos/data-engineering/airflow-dags!928brouberolcleanup-scheduler-logsmain
Customize query in GitLab

Related Objects

StatusSubtypeAssignedTask
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Declinedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedamastilovic
Resolvedbrouberol
ResolvedBTullis
Resolvedbking
Resolveddcausse
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedmfossati
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedgmodena
ResolvedBTullis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gehel triaged this task as High priority.May 7 2024, 1:15 PM
Gehel moved this task from Incoming to Epics on the Data-Platform-SRE board.

Change #1075165 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] WIP: enable Kubernetes executor

https://gerrit.wikimedia.org/r/1075165

As the Airflow scheduler will need to be able to create/get/list/delete/watch Pods as part of the task lifecycle, it will need to have the associated permissions through a dedicated ServiceAccount. However, experience has shown that we can't create Role or ClusterRole resources within a chart, as the deploy role cannot manage them. These resources must be defined in admin_ng.

Having talked to @JMeybohm, it appears that a middleground solution could be:

  • declare a airflow-scheduler ClusterRole in admin_ng
  • declare a airflow-<instance-name>-scheduler ServiceAccount and RoleBinding resources in the airflow chart, binding the airflow-scheduler ClusterRole to the airflow-<instance-name>-scheduler ServiceAccount

Change #1075508 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Deploy an airflow-scheduler ClusteRole to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1075508

After some testing, it appears that the deploy user cannot handle RoleBinding resources either, making the previously stated solution a dead end.

Change #1075508 merged by Brouberol:

[operations/deployment-charts@master] Specify a custom deploy clusterrole for airflow namespaces in dse

https://gerrit.wikimedia.org/r/1075508

We have been able to trigger our first task to Kubernetes! Tailing of the task logs while it's still running is done via the Kube API, and the logs are then uploaded to s3.

Screenshot 2024-09-25 at 16.15.21.png (2×2 px, 551 KB)

Screenshot 2024-09-25 at 16.15.15.png (1×2 px, 420 KB)

Change #1075165 merged by Brouberol:

[operations/deployment-charts@master] Enable the usage of the Kubernetes executor

https://gerrit.wikimedia.org/r/1075165

Work has been done to allow us to migrate from LocalExecutor to KubernetesExecutor. I'm going to send this task back to our backlog, as the actual migration won't happen before a bit.

BTullis renamed this task from Migrate the LocalExecutor to KubernetesExecutor to Migrate the airflow scheduler components to Kubernetes.Sep 27 2024, 3:37 PM
BTullis updated the task description. (Show Details)

Change #1097424 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: open port 8600 (webserver) for airflow services

https://gerrit.wikimedia.org/r/1097424

Change #1097424 merged by Brouberol:

[operations/puppet@production] global_config: open port 8600 (webserver) for airflow services

https://gerrit.wikimedia.org/r/1097424

Change #1097430 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] airflow: enable port 8600 to be reached from Kubernetes

https://gerrit.wikimedia.org/r/1097430

Change #1097430 merged by Brouberol:

[operations/puppet@production] airflow: enable port 8600 to be reached from Kubernetes

https://gerrit.wikimedia.org/r/1097430

brouberol claimed this task.
brouberol moved this task from Quarterly Goals to Done on the Data-Platform-SRE board.