We have decided that we would like to migrate Airflow to Kubernetes.
The project proposal is here: Airflow: High-Availability Strategy (currently a Google Doc).
This epic ticket is intended to track the progress of the project. There are several phases or streams of work involved.
Ensure that the cephosd cluster is production-ready
- T369582 Enable prometheus metrics on the cephosd cluster
- T369583 Configure availability and health monitoring for the cephosd cluster
- T330152 Deploy ceph radosgw processes to data-engineering cluster
- T330153 Configure load-balancing approriate for ceph radosgw services on the data-engineering cluster
- T372783 Verify that cephosd* server reimages work without adversely affecting cluster availability
T364386: Validate postgres operator and Ceph integration
- Update ceph package versions to the most recent stable version T362993
- Complete the Ceph container storage integration with the dse-k8s cluster T327259
- Decide on which postgresql operator to use T362999 - We have decided on https://cloudnative-pg.io/
- Build the requred container images for the cloudnativepg postgresql operator T364795
- Create a helm chart for the cloudnativepg postgresql operator T364797
- Deploy the cloudnativepg postgresql operator
- Validate that postgresql clusters can be created as required
- Test and validate the performance characteristics of postgresql clusters created by the operator
- Test and validate the resilience and recovery characteristics of postgresql clusters created by the operator
T364387: Adapt Airflow auth and DAG deployment method
- Create an airflow container image using blubber/kokkuri T363000
- Create an airflow chart that is appropriate to our needs T363001
- Create an LDAP group to use for testing Airflow on k8s T363003
- Create a git-sync container image to be used with airflow T368757
- Configure airflow authentication with OIDC
- Decide how airflow DAG deployment will work under Kubernetes T368033
- Create DNS aliases for the new airflow instances
- Allow inbound web traffic to the new airflow instances
T379267: Migrate the airflow webservers to Kubernetes
We will be migrating the Airflow webserver services first, whilst leaving the schedulers running on their existing hosts
- Deploy a new test-k8s instance to dse-k8s
- Migrate analytics_test airflow webserver to dse-k8s T374948
- Migrate analytics airflow webserver to dse-k8s T378439
- Migrate platform_eng airflow webserver to dse-k8s T378443
- Migrate search airflow webserver to dse-k8s T378441
- Migrate research airflow webserver to dse-k8s T378442
- Migrate product analytics airflow webserver to dse-k8s T378440
- Migrate wmde airflow webserver to dse-k8s T378438
T375871: Integrate Airflow with Kerberos
- Deploy the kerberos token renewer in the Airflow chart T375875
- Ensure that the filesystem permissions are correctly configured when Airflow jobs interact with Hadoop and HDFS T375895
- Ensure that we can submit spark jobs via spark3-submit from airflow T377928
- Validate that we can submit Spark jobs with Skein in Kubernetes T377602
T364389: Migrate the airflow scheduler components to Kubernetes
- Decide between the KubernetesExecutor and/or LocalKubernetesExecutor
- Add the airflow scheduler to its helm chart T368737
- Migrate all instance to the new executor model