Page MenuHomePhabricator

Migrate airflow-analytics-test webserver to Kubernetes
Closed, ResolvedPublic

Description

We can start delivering value to the teams even if we haven't figured everything out. By deploying the webserver pods to Kubernetes and integrating them with the an-db1001 database, we provide everyone with a public wikimedia.org subdomain and OIDC authentication, instead of relying on ssh tunneling to access the airflow UI.

For each instance:

  • allow ingress traffic to the airflow instances coming from the DSE_KUBEPODS subnet
  • Create the k8s namespaces
  • Create the k8s user kubeconfigs
  • Create the wikimedia.org public subdomains
  • Create the OIDC/CAS configuration
  • Create the config section in the private repo
  • Deploy the application
    • modify modules/profile/manifests/airflow.pp to support an optional secret secret_key and populate the webserver.secret_key config with it if found
    • add the secret key already found in /etc/helmfile-defaults/private/dse-k8s_services/airflow-analytics-test/dse-k8s-eqiad.yaml on the deployment secret to /srv/git/private/hieradata
  • Enable ATS traffic redirection and caching

Instances:

  • airflow-analytics-test

[ ] airflow-analytics
[ ] airflow-analytics-product
[ ] airflow-search
[ ] airflow-research
[ ] airflow-platform-eng
[ ] airflow-wmde

Update: I've reduced the scope to the initial test instance and will split out the production instances to their own tasks (or logically associate multiple instances in the same task, if that makes sense).

Procedure to follow: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Migrating_an_existing_instance

Details

Other Assignee
brouberol
SubjectRepoBranchLines +/-
operations/puppetproduction+16 -6
labs/privatemaster+3 -2
operations/deployment-chartsmaster+4 -2
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -1
operations/puppetproduction+5 -0
operations/deployment-chartsmaster+1 -3
operations/deployment-chartsmaster+1 -28
operations/deployment-chartsmaster+0 -24
operations/puppetproduction+3 -51
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+19 -15
operations/deployment-chartsmaster+8 -2
operations/puppetproduction+8 -0
labs/privatemaster+2 -0
operations/deployment-chartsmaster+100 -0
operations/puppetproduction+18 -1
operations/deployment-chartsmaster+29 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Here's a brain dump of the things I think we need to do once:

  • allow ingress traffic to the airflow instances coming from the DSE_KUBEPODS subnet

Here are the things we need to do for each airflow instance.

  • Create the k8s namespaces
  • Create the k8s user kubeconfigs
  • Create the wikimedia.org public subdomains
  • Create the OIDC/CAS configuration
  • Create the config section in the private repo
dse-k8s:
    ...
    airflow-<instance-name>:
      dse-k8s-eqiad:
        config:
          private:
            airflow__core__fernet_key: <random 64 characters>
            airflow__webserver__secret_key: <random 64 characters>
          airflow:
            postgresqlPass: <PG password>
          oidc:
            client_secret: <OIDC client secret>
  • Create the airflow helmfile/values by taking inspiration from airflow-test-k8s. We need to specify the following values:
config:
  airflow:
    dbName: '<name of DB on an-db1001>'
    dbUser: '<user used on an-db1001>'
    dags_folder: <dags folder that should be cloned from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/>
    instance_name: <dash separated name without the 'airflow-' prefix. Ex: analytics-product>

scheduler: 
  enabled: false # we don't currently run the scheduler in Kubernetes
  remote_host: <ip of the airflow instance>
  remote_port: <port of the airflow scheduler on the instance>

postgresql:
  cloudnative: false  # we don't currently run PG in Kubernetes

kerberos:
  enabled: false  # we don't currently run the Kerberos token renewer in Kubernetes. This config does not do anything atm, but will be useful in the future, when the chart supports it

external_services:
  postgresql: [analytics]

postgresql:
  cloudnative: false

ingress:
  gatewayHosts:
    default: "airflow-<instance-name>"
    extraFQDNs:
    - airflow-<instance-name>.wikimedia.org

oidc:
  client_id: <OIDC client id>

Feel free to copy the helmfile from helmfile.d/dse-k8s-services/airflow-test-k8s/ and adapt the namespace / instance

  • Deploy the application
  • Enable ATS traffic redirection and caching
NOTE: Instructions to create a new Airflow instance from scratch are available at https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Creating_a_new_instance, but these are slightly different because we're migrating an existing instance.

We also need to think about how we assign Airflow roles, and to whom. The way that it works would be by creating LDAP groups, and mapping them to Airflow roles via the config.airflow.auth.role_mappings value.

Do we want to create an admin LDAP group for each airflow instance, and assign a couple of people to that group for each instance? Do we want to only have SRE be admins? Do we want anyone from nda and wmf to be able to access every Airflow instance?

I think needs to be figured out as well before we can call the task done.

Gehel triaged this task as High priority.Sep 20 2024, 7:53 AM

Change #1075278 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] admin-ng: add airflow namespaces to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1075278

Change #1075321 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] airflow: allow traffic to webserver port from dse-k8s pods

https://gerrit.wikimedia.org/r/1075321

Change #1075278 merged by jenkins-bot:

[operations/deployment-charts@master] admin-ng: add airflow namespaces to dse-k8s-eqiad

https://gerrit.wikimedia.org/r/1075278

Change #1075321 merged by Bking:

[operations/puppet@production] airflow: allow traffic to webserver port from dse-k8s pods

https://gerrit.wikimedia.org/r/1075321

Change #1076775 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] dse-k8s-services: add airflow helmfile directory

https://gerrit.wikimedia.org/r/1076775

Change #1076775 abandoned by Bking:

[operations/deployment-charts@master] dse-k8s-services: add airflow helmfile directory

Reason:

will start over with recommended approach

https://gerrit.wikimedia.org/r/1076775

Change #1079292 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] idp.yaml: Add airflow service

https://gerrit.wikimedia.org/r/1079292

Change #1079296 had a related patch set uploaded (by Brouberol; author: Brouberol):

[labs/private@master] idp: add dummy client secret for aitflow_analytics_test

https://gerrit.wikimedia.org/r/1079296

Change #1079296 merged by Bking:

[labs/private@master] idp: add dummy client secret for aitflow_analytics_test

https://gerrit.wikimedia.org/r/1079296

Change #1079292 merged by Bking:

[operations/puppet@production] idp.yaml: Add airflow service

https://gerrit.wikimedia.org/r/1079292

Change #1079361 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] ATS: add mapping for airflow-analytics-test

https://gerrit.wikimedia.org/r/1079361

Change #1080273 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflo-analytics-test: define admin LDAP group -> role mapping

https://gerrit.wikimedia.org/r/1080273

Change #1080273 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics-test: define admin LDAP group -> role mapping

https://gerrit.wikimedia.org/r/1080273

Change #1080742 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytic-test: comment out the postgresql deployment

https://gerrit.wikimedia.org/r/1080742

Change #1080742 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytic-test: comment out the postgresql deployment

https://gerrit.wikimedia.org/r/1080742

Change #1081152 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] airflow-analytics-test: correct oidc mapping

https://gerrit.wikimedia.org/r/1081152

Change #1081152 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytics-test: correct oidc mapping

https://gerrit.wikimedia.org/r/1081152

Change #1081157 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Remove unused airflow kubernetes user credentials

https://gerrit.wikimedia.org/r/1081157

Change #1081161 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] admin_ng (dse-k8s): remove unused namespaces

https://gerrit.wikimedia.org/r/1081161

Change #1081157 merged by Bking:

[operations/puppet@production] Remove unused airflow kubernetes user credentials

https://gerrit.wikimedia.org/r/1081157

Change #1081161 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng (dse-k8s): remove unused namespaces

https://gerrit.wikimedia.org/r/1081161

Change #1081178 had a related patch set uploaded (by Bking; author: Bking):

[operations/deployment-charts@master] airflow: disable network egress by default

https://gerrit.wikimedia.org/r/1081178

Change #1081200 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: remove custom network policy when the scheduler is running outside Kubernetes

https://gerrit.wikimedia.org/r/1081200

Change #1081200 merged by Brouberol:

[operations/deployment-charts@master] airflow: remove custom network policy when the scheduler is running outside Kubernetes

https://gerrit.wikimedia.org/r/1081200

Change #1081210 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: fix missing configmap when the scheduler isn't deployed

https://gerrit.wikimedia.org/r/1081210

Change #1081210 merged by Brouberol:

[operations/deployment-charts@master] airflow: fix missing configmap when the scheduler isn't deployed

https://gerrit.wikimedia.org/r/1081210

Change #1079361 merged by Brouberol:

[operations/puppet@production] ATS: add mapping for airflow-analytics-test

https://gerrit.wikimedia.org/r/1079361

Change #1081228 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytic-test: fix OIDC client id

https://gerrit.wikimedia.org/r/1081228

Change #1081178 abandoned by Bking:

[operations/deployment-charts@master] airflow: disable network egress by default

Reason:

no longer needed

https://gerrit.wikimedia.org/r/1081178

Change #1081228 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytic-test: fix OIDC client id

https://gerrit.wikimedia.org/r/1081228

Change #1081230 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytic-test: disable remote logging

https://gerrit.wikimedia.org/r/1081230

Change #1081230 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytic-test: disable remote logging

https://gerrit.wikimedia.org/r/1081230

Change #1081261 had a related patch set uploaded (by Bking; author: Bking):

[labs/private@master] analytics_test_cluster: add secret

https://gerrit.wikimedia.org/r/1081261

Change #1081268 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] airflow: make 'secret_key' configurable

https://gerrit.wikimedia.org/r/1081268

Change #1081261 merged by Bking:

[labs/private@master] analytics_test_cluster: add secret

https://gerrit.wikimedia.org/r/1081261

Change #1081268 merged by Brouberol:

[operations/puppet@production] airflow: make 'secret_key' configurable

https://gerrit.wikimedia.org/r/1081268

The airflow-analytics-test webserver migration is now complete! The UI is reachable (and OIDC-authenticated) at https://airflow-analytics-test.wikimedia.org.

We have also documented the migration process here to make it easier to migrate the other instances.

brouberol updated the task description. (Show Details)
brouberol updated the task description. (Show Details)

I think migrating the test instance is a good AC for this task; we can create a new task or tasks for migrating the production instances. Closing...

brouberol renamed this task from Migrate airflow webservers to Kubernetes to Migrate airflow-analytics-test webserver to Kubernetes.Thu, Nov 7, 3:22 PM