This Phab task describes proposed approaches to migrating the analytics Airflow DAGs to the analytics Airflow instance deployed in Kubernetes.
Terminology
an-launcher1002 - existing production Airflow instance, managed by Puppet.
airflow.wikimedia.org - The main Kubernetes Airflow deployment running analytics DAGs, the target of this migration
main-k8s - Equivalent to airflow.wikimedia.org
test-k8s - Kubernetes Airflow deployment we can use for testing
Known issues
(Some of these issues have been identified during prior testing of search DAGs and documented here)
- Hive CLI is not available. If a DAG is using a BashOperator to run a Hive command line (e.g. to run CREATE TABLE commands), this will not work in Kubernetes deployment of Airflow.
- refinery-drop-older-than script is not available
- Access to all external links has to be through an Envoy sidecar host/port that will run alongside Airflow kubernetes pods. In short, due to networking security and firewalls that kubernetes pods run behind, accessing external URLs is prohibited by default. Since there are a number of scenarios where Airflow DAGs need to reach external URLs (sourcing HQL files hosted on gitlab.wikimedia.org, accessing meta.wikimedia.org web services, etc.), such DAGs will require modification to the URLs such that they are referencing not the original hostname/port combinations but a pre-determined http://envoy:port endpoint that replaces the specific original hostname.
This might require an additional Airflow library utility that would automatically convert an URL depending on whether Airflow is running within a Kubernetes context or not. Note that we already have a function that provides that information, so implementation should be relatively trivial.
Testing in test-k8s
One important issue to consider when testing DAGs in test-k8s is to ensure all outputs are going into a designated temporary namespace/table/DB that will not interfere with production data/results. Special care should be given to making sure that no data/output being written is going to interfere with production data.
- If a DAG is running a SQL query, ensure that destination_table parameter is replaced with some pre-determined temporary table
- If a DAG is exporting to Apache Druid, ensure that the target druid_datasource is a temporary one
- Etc.
Most of these configuration settings are available as DAG properties/variables, and so the goal should be to limit changes to these properties and refrain from modifying the Airflow operators or HQL files directly. If that's not possible, then we should take the opportunity to fix the issue and extract the destination paths into a DAG variable.
General approach
While tests performed on a Kubernetes-provided Airflow instance so far have been fairly extensive and encompassed most of the existing usage patterns, we should nevertheless ensure that each DAG we are migrating has been positively tested to properly run at least once before we commit to migrating to production.
The sheer number of DAGs that need to be put through a test run in a Kubernetes environment dictates the migration strategies we could choose from:
1. Keep an-launcher1002 and airflow.wikimedia.org (main-k8s) running in parallel, migrate DAGs one-by-one
Steps to perform:
- Migrate the Airflow DB and logs to Kubernetes
- One-by-one DAG: test them in test-k8s, apply necessary modifications, migrate to main-k8s.
- Once all DAGs have been migrated, turn an-launcher1002 off.
Pro
- Migration is not overwhelming, can be stretched over a longer period of time
- SRE involved mostly in the beginning of the effort, later work can be done by DE alone
Contra
- The longer this migration takes, the greater the log and Airflow DB discrepancy between an-launcher1002 and main-k8s. Logs and DB changes generated after the initial migration would be lost.
- Risk of migration becoming only partially successful (i.e. some DAGs migrated, some DAGs encounter a blocker)
1a. Keep an-launcher1002 and main-k8s running in parallel, start main-k8s from scratch, migrate DAGs one-by-one
Steps to perform:
- Deploy main-k8s to Kubernetes from scratch - Airflow DB and logs are initalized empty and all job history is lost
- One-by-one DAG: test them in test-k8s, apply necessary modifications, migrate to main-k8s.
- Once all DAGs have been migrated, keep an-launcher1002 alive but with all migrated DAGs paused/turned off.
Pro
- Migration is not overwhelming, can be stretched over a longer period of time
- SRE involved mostly in the beginning of the effort, later work can be done by DE alone
- Continuity of DAG runs is enabled, but split into two realms: "before migration" on dormant an-launcher1002 and "after migration" on main-k8s
Contra
- Risk of migration becoming only partially successful (i.e. some DAGs migrated, some DAGs encounter a blocker)
2. Prepare all DAGs for migration, turn an-launcher1002 off, migrate, turn main-k8s on
Steps to perform:
- Test all DAGs in test-k8s. Do not advance migration until all blocking issues are resolved.
- Turn an-launcher1002 off (stop production).
- Migrate Airflow DB and job logs to main-k8s.
- Migrate tested DAGs from test-k8s into main-k8s.
Pro
- All tests and related possible issues would be performed/identified beforehand, reducing the risk of a partially successful migration
- No discontinuity of Airflow DB and job logs
Contra
- Developer effort concentrated into a much smaller timeframe
- Requires combined presence of both SRE and DE folks, especially for the portion of work that comes at the point of turning existing Airflow instance off, and subsequent tasks. Therefore, we should preferably choose folks in close-enough time zones, and day/time when those folks are all available.
- Unlikely, but in case some issues arise only after migrating to airflow-analytics.k8s there would be a much greater urgency to fix them since production is stopped