Page MenuHomePhabricator

Migrate analytics Airflow DAGs to k8s Airflow deployment
Closed, ResolvedPublic

Description

This Phab task describes proposed approaches to migrating the analytics Airflow DAGs to the analytics Airflow instance deployed in Kubernetes.

Terminology

an-launcher1002 - existing production Airflow instance, managed by Puppet.
airflow.wikimedia.org - The main Kubernetes Airflow deployment running analytics DAGs, the target of this migration
main-k8s - Equivalent to airflow.wikimedia.org
test-k8s - Kubernetes Airflow deployment we can use for testing

Known issues

(Some of these issues have been identified during prior testing of search DAGs and documented here)

  • Hive CLI is not available. If a DAG is using a BashOperator to run a Hive command line (e.g. to run CREATE TABLE commands), this will not work in Kubernetes deployment of Airflow.
  • refinery-drop-older-than script is not available
  • Access to all external links has to be through an Envoy sidecar host/port that will run alongside Airflow kubernetes pods. In short, due to networking security and firewalls that kubernetes pods run behind, accessing external URLs is prohibited by default. Since there are a number of scenarios where Airflow DAGs need to reach external URLs (sourcing HQL files hosted on gitlab.wikimedia.org, accessing meta.wikimedia.org web services, etc.), such DAGs will require modification to the URLs such that they are referencing not the original hostname/port combinations but a pre-determined http://envoy:port endpoint that replaces the specific original hostname.

This might require an additional Airflow library utility that would automatically convert an URL depending on whether Airflow is running within a Kubernetes context or not. Note that we already have a function that provides that information, so implementation should be relatively trivial.

Testing in test-k8s

One important issue to consider when testing DAGs in test-k8s is to ensure all outputs are going into a designated temporary namespace/table/DB that will not interfere with production data/results. Special care should be given to making sure that no data/output being written is going to interfere with production data.

  • If a DAG is running a SQL query, ensure that destination_table parameter is replaced with some pre-determined temporary table
  • If a DAG is exporting to Apache Druid, ensure that the target druid_datasource is a temporary one
  • Etc.

Most of these configuration settings are available as DAG properties/variables, and so the goal should be to limit changes to these properties and refrain from modifying the Airflow operators or HQL files directly. If that's not possible, then we should take the opportunity to fix the issue and extract the destination paths into a DAG variable.

General approach

While tests performed on a Kubernetes-provided Airflow instance so far have been fairly extensive and encompassed most of the existing usage patterns, we should nevertheless ensure that each DAG we are migrating has been positively tested to properly run at least once before we commit to migrating to production.

The sheer number of DAGs that need to be put through a test run in a Kubernetes environment dictates the migration strategies we could choose from:

1. Keep an-launcher1002 and airflow.wikimedia.org (main-k8s) running in parallel, migrate DAGs one-by-one

Steps to perform:

  1. Migrate the Airflow DB and logs to Kubernetes
  2. One-by-one DAG: test them in test-k8s, apply necessary modifications, migrate to main-k8s.
  3. Once all DAGs have been migrated, turn an-launcher1002 off.

Pro

  • Migration is not overwhelming, can be stretched over a longer period of time
  • SRE involved mostly in the beginning of the effort, later work can be done by DE alone

Contra

  • The longer this migration takes, the greater the log and Airflow DB discrepancy between an-launcher1002 and main-k8s. Logs and DB changes generated after the initial migration would be lost.
  • Risk of migration becoming only partially successful (i.e. some DAGs migrated, some DAGs encounter a blocker)
1a. Keep an-launcher1002 and main-k8s running in parallel, start main-k8s from scratch, migrate DAGs one-by-one

Steps to perform:

  1. Deploy main-k8s to Kubernetes from scratch - Airflow DB and logs are initalized empty and all job history is lost
  2. One-by-one DAG: test them in test-k8s, apply necessary modifications, migrate to main-k8s.
  3. Once all DAGs have been migrated, keep an-launcher1002 alive but with all migrated DAGs paused/turned off.

Pro

  • Migration is not overwhelming, can be stretched over a longer period of time
  • SRE involved mostly in the beginning of the effort, later work can be done by DE alone
  • Continuity of DAG runs is enabled, but split into two realms: "before migration" on dormant an-launcher1002 and "after migration" on main-k8s

Contra

  • Risk of migration becoming only partially successful (i.e. some DAGs migrated, some DAGs encounter a blocker)
NOTE: Migration to main-k8s includes migrating away from the analytics/dags directory to main/dags directory, too.
2. Prepare all DAGs for migration, turn an-launcher1002 off, migrate, turn main-k8s on

Steps to perform:

  1. Test all DAGs in test-k8s. Do not advance migration until all blocking issues are resolved.
  2. Turn an-launcher1002 off (stop production).
  3. Migrate Airflow DB and job logs to main-k8s.
  4. Migrate tested DAGs from test-k8s into main-k8s.

Pro

  • All tests and related possible issues would be performed/identified beforehand, reducing the risk of a partially successful migration
  • No discontinuity of Airflow DB and job logs

Contra

  • Developer effort concentrated into a much smaller timeframe
  • Requires combined presence of both SRE and DE folks, especially for the portion of work that comes at the point of turning existing Airflow instance off, and subsequent tasks. Therefore, we should preferably choose folks in close-enough time zones, and day/time when those folks are all available.
  • Unlikely, but in case some issues arise only after migrating to airflow-analytics.k8s there would be a much greater urgency to fix them since production is stopped

Details

Other Assignee
mforns
Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+1 -2
operations/deployment-chartsmaster+0 -8
operations/deployment-chartsmaster+3 -47
operations/deployment-chartsmaster+6 -7
operations/deployment-chartsmaster+14 -0
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+3 -0
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+9 -0
operations/deployment-chartsmaster+15 -1
operations/deployment-chartsmaster+71 -5
operations/deployment-chartsmaster+19 -0
operations/deployment-chartsmaster+19 -0
operations/deployment-chartsmaster+4 -2
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+16 -2
operations/deployment-chartsmaster+4 -2
operations/puppetproduction+5 -0
operations/puppetproduction+8 -0
operations/puppetproduction+8 -0
operations/deployment-chartsmaster+107 -0
operations/dnsmaster+3 -0
operations/deployment-chartsmaster+3 -0
operations/deployment-chartsmaster+4 -0
Show related patches Customize query in gerrit
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Migrate MediaWiki Content DAGs to Kubernetes Main instancerepos/data-engineering/airflow-dags!1163mfornsk8s-migration-mw-content-dagsmain
Adjust start date for webrequest DAGs for K8s migrationrepos/data-engineering/airflow-dags!1162mfornsk8s-migration-adapt-start-dates-for-webrequest-dagsmain
Migrating cassandra_load_* DAGs to Kubernetesrepos/data-engineering/airflow-dags!1158amastilovicfeature/Migrate-Cassandra-Load-To-Kubernetesmain
Adjust migration start dates for druid loading dagsrepos/data-engineering/airflow-dags!1157mfornsadapt-start-dates-for-druid-load-dags-k8s-migrationmain
Kubernetes migration: Fix end dates for webrequest DAGs in analytics folderrepos/data-engineering/airflow-dags!1154mfornsfix-webrequest-end-datesmain
Migrate Wikidata item page link DAG to Kubernetesrepos/data-engineering/airflow-dags!1149amastilovicfeature/Migrate-Wikidata-Item-Page-Linkmain
Migrate druid_load_* DAGs to Kubernetesrepos/data-engineering/airflow-dags!1147amastilovicfeature/Migrate-Druid-Loadmain
remove webrequest metrics analyzer from mainrepos/data-engineering/airflow-dags!1146mfornsremove-webrequest-actor-metrics-analyzer-from-mainmain
Migrate canary_events DAG to Kubernetesrepos/data-engineering/airflow-dags!1143amastilovicfeature/Migrate-Canary-Eventsmain
Migrate webrequest DAGs to Kubernetesrepos/data-engineering/airflow-dags!1141amastilovicfeature/Migrate-Webrequestmain
Migrate mediawiki DAGs to Kubernetesrepos/data-engineering/airflow-dags!1139amastilovicfeature/Migrate-Mediawiki-DAGsmain
Migrate mediawiki_history_load_dag to Kubernetesrepos/data-engineering/airflow-dags!1138amastilovicfeature/Migrate-Mediawiki-History-Loadmain
Migrate WMCS DAG to Kubernetesrepos/data-engineering/airflow-dags!1136amastilovicfeature/Migrate-WMCSmain
Migrate webrequest_frontend DAGs to Kubernetesrepos/data-engineering/airflow-dags!1135amastilovicfeature/Migrate-Webrequest-Frontend-DAGmain
Migrate unique_devices DAGs to Kubernetesrepos/data-engineering/airflow-dags!1133amastilovicfeature/Migrate-Unique-Devices-DAGmain
Migrate datahub/ingestion DAG to Kubernetesrepos/data-engineering/airflow-dags!1132amastilovicfeature/Migrate-Datahub-Ingestion-DAGmain
Migrate commons DAGs to Kubernetesrepos/data-engineering/airflow-dags!1131amastilovicfeature/Migrate-Commons-DAGsmain
Migrate the browser-weekly DAG to Kubernetesrepos/data-engineering/airflow-dags!1129amastilovicfeature/Migrate-Browser-Weekly-DAGmain
Migrate APIs DAG to Kubernetesrepos/data-engineering/airflow-dags!1127amastilovicfeature/Migrate-APIs-DAGmain
Migrate AQS DAG to Kubernetesrepos/data-engineering/airflow-dags!1126amastilovicfeature/Migrate-AQSmain
Show related patches Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1126655 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: fix datahub connection host values

https://gerrit.wikimedia.org/r/1126655

Change #1126655 merged by Brouberol:

[operations/deployment-charts@master] airflow: fix datahub connection host values

https://gerrit.wikimedia.org/r/1126655

Change #1127417 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: render /etc/refinery/event_intake_service_urls.yaml in task pods

https://gerrit.wikimedia.org/r/1127417

Change #1127418 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-main: render /etc/refinery/event_intake_service_urls.yaml in task pods

https://gerrit.wikimedia.org/r/1127418

Change #1127417 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-test-k8s: render /etc/refinery/event_intake_service_urls.yaml in task pods

https://gerrit.wikimedia.org/r/1127417

Change #1127418 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-main: render /etc/refinery/event_intake_service_urls.yaml in task pods

https://gerrit.wikimedia.org/r/1127418

Change #1127800 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas

https://gerrit.wikimedia.org/r/1127800

Change #1127800 merged by Brouberol:

[operations/deployment-charts@master] mediawiki-dumps-legacy: define networkpolicies to allow egress to the wikireplicas

https://gerrit.wikimedia.org/r/1127800

Change #1123527 abandoned by Brouberol:

[operations/deployment-charts@master] airflow: mount the hadoop configuration in the webserver and scheduler pods

Reason:

Superseded by https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124042?usp=search

https://gerrit.wikimedia.org/r/1123527

Change #1128368 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: display a custom message explaining the migration status

https://gerrit.wikimedia.org/r/1128368

Change #1128368 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-test-k8s: display a custom message explaining the migration status

https://gerrit.wikimedia.org/r/1128368

mforns updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1154

Kubernetes migration: Fix end dates for webrequest DAGs in analytics folder

Change #1128889 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] an-druid: allow k8s pods to hit the coordinator API

https://gerrit.wikimedia.org/r/1128889

Change #1128890 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: grant an-druid access to the analytics profiles

https://gerrit.wikimedia.org/r/1128890

Change #1128890 merged by Brouberol:

[operations/deployment-charts@master] airflow: grant an-druid access to the analytics profiles

https://gerrit.wikimedia.org/r/1128890

Change #1128889 merged by Brouberol:

[operations/puppet@production] an-druid: allow k8s pods to hit the coordinator API

https://gerrit.wikimedia.org/r/1128889

Change #1128906 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] aiflow-research: set a temporary network policy to egress to an-laucher1002:8600

https://gerrit.wikimedia.org/r/1128906

Change #1128906 merged by Brouberol:

[operations/deployment-charts@master] aiflow-research: set a temporary network policy to egress to an-laucher1002:8600

https://gerrit.wikimedia.org/r/1128906

Change #1129190 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-main: increase the scheduler resources

https://gerrit.wikimedia.org/r/1129190

Change #1129191 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad

https://gerrit.wikimedia.org/r/1129191

Change #1129190 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-main: increase the scheduler resources

https://gerrit.wikimedia.org/r/1129190

Change #1129191 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-test-k8s/main: allow egress to cassandra-analytics-query-service-storage-a-eqiad

https://gerrit.wikimedia.org/r/1129191

Change #1129850 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] Fix typo in values

https://gerrit.wikimedia.org/r/1129850

Change #1129850 merged by Brouberol:

[operations/deployment-charts@master] Fix typo in values

https://gerrit.wikimedia.org/r/1129850

Change #1130079 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: restore the initial instance settings

https://gerrit.wikimedia.org/r/1130079

Change #1130080 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-main: drop the migration info message from the UI

https://gerrit.wikimedia.org/r/1130080

Change #1130079 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-test-k8s: restore the initial instance settings

https://gerrit.wikimedia.org/r/1130079

Change #1130080 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-main: drop the migration info message from the UI

https://gerrit.wikimedia.org/r/1130080

Change #1130088 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: restore analytics-test values

https://gerrit.wikimedia.org/r/1130088

Change #1130088 merged by Brouberol:

[operations/deployment-charts@master] airflow-test-k8s: restore analytics-test values

https://gerrit.wikimedia.org/r/1130088

Change #1130091 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] mediawiki-dumps-legacy: dumps DAGs are now going to run on airflow-test-k8s

https://gerrit.wikimedia.org/r/1130091

Change #1130091 merged by Brouberol:

[operations/deployment-charts@master] mediawiki-dumps-legacy: dumps DAGs are now going to run on airflow-test-k8s

https://gerrit.wikimedia.org/r/1130091