Page MenuHomePhabricator

Migrate the airflow-analytics scheduler to Kubernetes
Closed, ResolvedPublic

Event Timeline

Gehel triaged this task as High priority.Nov 25 2024, 1:33 PM

Migration notes here: https://etherpad.wikimedia.org/p/airflow-analytics-migration

We have created a list of all un-paused jobs with:

curl -X 'GET'   'http://localhost:8600/api/v1/dags?limit=200&only_active=true&paused=false'   -H 'accept: application/json' |jq -r '.dags[].dag_id' > all_unpaused_dags_T380619.txt
curl -X 'GET'   'http://localhost:8600/api/v1/dags?offset=100&only_active=true&paused=false'   -H 'accept: application/json' |jq -r '.dags[].dag_id' > all_unpaused_dags_T380619_2.txt
cat all_unpaused_dags_T380619.txt all_unpaused_dags_T380619_2.txt > all_unpaused_dags_T380619_combined.txt

For some reason the API wasn't working with the limit=200 parameter, so I had to use an offset=100 to get the second part of thelist, then concatenate them.

I also un-paused the canary events DAG, because we want to keep this going as much as possible.

Change #1113101 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable gobblin timers on an-launcher1002

https://gerrit.wikimedia.org/r/1113101

Change #1113101 merged by Btullis:

[operations/puppet@production] Temporarily disable gobblin timers on an-launcher1002

https://gerrit.wikimedia.org/r/1113101

Change #1113108 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics: migrate scheduler and database to Kubernetes

https://gerrit.wikimedia.org/r/1113108

Change #1113108 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics: migrate scheduler and database to Kubernetes

https://gerrit.wikimedia.org/r/1113108

Icinga downtime and Alertmanager silence (ID=9f1d8f2e-4415-45fe-b65f-85692fbd29f5) set by btullis@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: Migrating to kubernetes

an-launcher1002.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=758003f6-c030-40a2-8737-def8016b0655) set by btullis@cumin1002 for 4:00:00 on 1 host(s) and their services with reason: Migrating to kubernetes

an-launcher1002.eqiad.wmnet

Mentioned in SAL (#wikimedia-analytics) [2025-01-21T13:24:04Z] <btullis> stopped airflow services on an-launcher1002 for T380619

Change #1113136 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics: fix DB cluster size

https://gerrit.wikimedia.org/r/1113136

Change #1113136 merged by jenkins-bot:

[operations/deployment-charts@master] airflow-analytics: fix DB cluster size

https://gerrit.wikimedia.org/r/1113136

Change #1113145 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics: remove import configuration

https://gerrit.wikimedia.org/r/1113145

Change #1113145 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics: remove import configuration

https://gerrit.wikimedia.org/r/1113145

Change #1113149 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow-analytics: Allow access to the mw-api via service mesh

https://gerrit.wikimedia.org/r/1113149

Change #1113151 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] global_config: add the IP of the dyna proxy

https://gerrit.wikimedia.org/r/1113151

Change #1113151 merged by Brouberol:

[operations/puppet@production] global_config: add the IP of the dyna proxy

https://gerrit.wikimedia.org/r/1113151

Change #1113159 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics: allow the egress to ATS for task pods

https://gerrit.wikimedia.org/r/1113159

Change #1113159 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics: allow the egress to ATS for task pods

https://gerrit.wikimedia.org/r/1113159

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1040

Draft: airflow-analytics: inject the CLASSPATH env variable into the environment

Change #1113172 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: add missing airflow.worker.extra-config-volumes

https://gerrit.wikimedia.org/r/1113172

Change #1113172 merged by Brouberol:

[operations/deployment-charts@master] airflow: add missing airflow.worker.extra-config-volumes

https://gerrit.wikimedia.org/r/1113172

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1040

airflow-analytics: inject the CLASSPATH env variable into the environment

Change #1113176 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Revert "global_config: add the IP of the dyna proxy"

https://gerrit.wikimedia.org/r/1113176

Change #1113176 merged by Brouberol:

[operations/puppet@production] Revert "global_config: add the IP of the dyna proxy"

https://gerrit.wikimedia.org/r/1113176

Change #1113198 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: DRY extra volume mounts

https://gerrit.wikimedia.org/r/1113198

Follow up from today's migration conversation.

mw_content_reconcile_mw_content_history_daily and other mediawiki_content DAGs currently hit public endpoints like https://noc.wikimedia.org/conf/dblists/open.dblist to generate their dynamic tasks.

Do we know of an internal equivalent for https://noc.wikimedia.org?

Change #1113198 merged by Brouberol:

[operations/deployment-charts@master] airflow: DRY extra volume mounts

https://gerrit.wikimedia.org/r/1113198

Change #1115855 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: mirror datahub configuration from airflow hosts

https://gerrit.wikimedia.org/r/1115855

Change #1115855 merged by Brouberol:

[operations/deployment-charts@master] airflow: mirror datahub configuration from airflow hosts

https://gerrit.wikimedia.org/r/1115855

BTullis subscribed.

I'm removing myself as the assignee of this ticket, as I'll be out on leave for a couple of weeks. Someone else may claim the ticket in the meantime.

Do we know of an internal equivalent for https://noc.wikimedia.org?

Responding for posterity's sake. We have deployed a service mesh envoy proxy pod running alongside airflow. To reach out to https://noc.wikimedia.org from within Kubernetes, you can reach out to http://envoy:6509. @amastilovic has defined https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/c2cc935e38759409ce9a87a77ad5ae25222af09f/wmf_airflow_common/util.py#L201 to help us with the URL mapping.

Do we know of an internal equivalent for https://noc.wikimedia.org?

Responding for posterity's sake. We have deployed a service mesh envoy proxy pod running alongside airflow. To reach out to https://noc.wikimedia.org from within Kubernetes, you can reach out to http://envoy:6509. @amastilovic has defined https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/c2cc935e38759409ce9a87a77ad5ae25222af09f/wmf_airflow_common/util.py#L201 to help us with the URL mapping.

NICE! Thank you both for this. It helps the dev experience a lot.

Change #1113149 abandoned by Btullis:

[operations/deployment-charts@master] airflow-analytics: Allow access to the mw-api via service mesh

Reason:

No longer required.

https://gerrit.wikimedia.org/r/1113149