Page MenuHomePhabricator

Migrate Airflow to the dse-k8s cluster
Closed, ResolvedPublic

Description

We have decided that we would like to migrate Airflow to Kubernetes.

The project proposal is here: Airflow: High-Availability Strategy (currently a Google Doc).

This epic ticket is intended to track the progress of the project. There are several phases or streams of work involved.

Ensure that the cephosd cluster is production-ready

T364386: Validate postgres operator and Ceph integration

  • Update ceph package versions to the most recent stable version T362993
  • Complete the Ceph container storage integration with the dse-k8s cluster T327259
  • Decide on which postgresql operator to use T362999 - We have decided on https://cloudnative-pg.io/
  • Build the requred container images for the cloudnativepg postgresql operator T364795
  • Create a helm chart for the cloudnativepg postgresql operator T364797
  • Deploy the cloudnativepg postgresql operator
  • Validate that postgresql clusters can be created as required
  • Test and validate the performance characteristics of postgresql clusters created by the operator
  • Test and validate the resilience and recovery characteristics of postgresql clusters created by the operator

T364387: Adapt Airflow auth and DAG deployment method

  • Create an airflow container image using blubber/kokkuri T363000
  • Create an airflow chart that is appropriate to our needs T363001
  • Create an LDAP group to use for testing Airflow on k8s T363003
  • Create a git-sync container image to be used with airflow T368757
  • Configure airflow authentication with OIDC
  • Decide how airflow DAG deployment will work under Kubernetes T368033
  • Create DNS aliases for the new airflow instances
  • Allow inbound web traffic to the new airflow instances

T379267: Migrate the airflow webservers to Kubernetes

We will be migrating the Airflow webserver services first, whilst leaving the schedulers running on their existing hosts

  • Deploy a new test-k8s instance to dse-k8s
  • Migrate analytics_test airflow webserver to dse-k8s T374948
  • Migrate analytics airflow webserver to dse-k8s T378439
  • Migrate platform_eng airflow webserver to dse-k8s T378443
  • Migrate search airflow webserver to dse-k8s T378441
  • Migrate research airflow webserver to dse-k8s T378442
  • Migrate product analytics airflow webserver to dse-k8s T378440
  • Migrate wmde airflow webserver to dse-k8s T378438

T375871: Integrate Airflow with Kerberos

  • Deploy the kerberos token renewer in the Airflow chart T375875
  • Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS T375895
  • Ensure that we can submit spark jobs via spark3-submit from airflow T377928
  • Validate that we can submit Spark jobs with Skein in Kubernetes T377602

T364389: Migrate the airflow scheduler components to Kubernetes

  • Decide between the KubernetesExecutor and/or LocalKubernetesExecutor
  • Add the airflow scheduler to its helm chart T368737
  • Migrate all instance to the new executor model

Related Objects

StatusSubtypeAssignedTask
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbking
Resolvedbking
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbking
Resolvedbrouberol
Resolved Stevemunene
Resolved Stevemunene
Resolvedbrouberol
ResolvedMoritzMuehlenhoff
OpenNone
DuplicateNone
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Declinedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedamastilovic
Resolvedbrouberol
ResolvedBTullis
Resolvedbking
Resolveddcausse
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedmfossati
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedgmodena
ResolvedBTullis
Resolved Stevemunene
Resolved Stevemunene
Resolvedbrouberol
Resolved Stevemunene
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
DeclinedNone
Resolvedbking
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedbrouberol
Declinedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedbrouberol
Resolvedbking
Resolvedbrouberol
Resolvedbrouberol
Resolvedbking
Resolvedbrouberol
Resolved Stevemunene
Resolved Stevemunene
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Declinedbrouberol
Resolvedbking
ResolvedBTullis
Resolved Stevemunene
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolvedbrouberol
Resolved Stevemunene
Resolvedbrouberol
Resolvedbrouberol
Resolvedamastilovic
Resolved Stevemunene
Resolved Stevemunene
Resolvedbrouberol
Resolvedamastilovic
ResolvedJAllemandou
ResolvedBTullis
Resolvedbrouberol
ResolvedBTullis
ResolvedBTullis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1081082 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: reflect recent changes in MX server hostnames

https://gerrit.wikimedia.org/r/1081082

Change #1081082 merged by Brouberol:

[operations/deployment-charts@master] airflow: reflect recent changes in MX server hostnames

https://gerrit.wikimedia.org/r/1081082

brouberol changed the status of subtask Restricted Task from Open to In Progress.Nov 8 2024, 10:29 AM
brouberol closed subtask Restricted Task as Resolved.Nov 12 2024, 4:55 PM
brouberol updated the task description. (Show Details)

Change #1102312 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: enable the support of multiple executors

https://gerrit.wikimedia.org/r/1102312

Change #1102312 merged by Brouberol:

[operations/deployment-charts@master] airflow: enable the support of multiple executors

https://gerrit.wikimedia.org/r/1102312

Change #1134283 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Reduce the verbosity of pgbouncer logs in airflow deployments

https://gerrit.wikimedia.org/r/1134283

Change #1134283 merged by jenkins-bot:

[operations/deployment-charts@master] Reduce the verbosity of pgbouncer logs in airflow deployments

https://gerrit.wikimedia.org/r/1134283

brouberol moved this task from Done to Quarterly Goals on the Data-Platform-SRE board.

We're so close to closing this epic. How can we best get over the line?
Shall we move some tickets out to a follow-up tracking ticket, or can we just close this with some tasks still open?