Page MenuHomePhabricator

Use the KubernetesPodOperator for tasks that require access to refine python scripts
Closed, ResolvedPublic

Description

Whilst working on T380621: Migrate the airflow-search scheduler to Kubernetes we observed that certain DAG tasks were attempting to use the BashOperator to call refine scripts.

The specific case is that drop_old_data_daily was trying to run:

PYTHONPATH=/srv/deployment/analytics/refinery/python /usr/bin/python3 /srv/deployment/analytics/refinery/bin/refinery-drop-older-than

This worked when the scheduler was running on an-airflow1005 and was configured to use the LocalExecutor, because refinery is deployed to an-airflow1005. However, these scripts and their python libraries are not available in our airflow images.

We have decided that we would like to implement this by using the KubernetesPodOperator with a custom refinery image.

The image has been created in T383417: Create a container image for analytics/refinery to be used with Airflow tasks and is now available for use.

We will probably want to use a pod_template_file in the same way as the executor, although we will not need the same airflow configuration files.

We will need the following ConfigMap objects to be available in the pods:

  • Hadoop configuration files
  • Kerberos credential cache

We can start by making a suitable DAG from scratch in the test cluster.

After we have shown that it works, we may with to modify the python_script_executor, or otherwise modify the DAGs themselves to use the new method.

Note that an alternative to the pod template file is described here: How to use cluster ConfigMaps, Secrets, and Volumes with Pod.
This option may also be useful to explore.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
airflow: add apache-airflow-providers-cncf-kubernetes dependencyrepos/data-engineering/airflow-dags!1036brouberolT383430-hotfixmain
kubernetes_pod_operator: rename the default template namerepos/data-engineering/airflow-dags!1029brouberolT383430-rename-default-tplmain
Execute refinery scripts in a k8s pod.repos/data-engineering/airflow-dags!1028gmodenadata-retention-with-k8s-operatormain
Set the WORKDIR to a directory writeable by the user (2)repos/data-engineering/refinery!6brouberolT383430main
Set the WORKDIR to a directory writeable by the userrepos/data-engineering/refinery!5brouberolT383430main
test_k8s: define DAG that displays partitions that would be deleted from HDFSrepos/data-engineering/airflow-dags!1024brouberolT383430main
install ps in the docker imagerepos/data-engineering/refinery!4brouberolT383430main
Define a helper class for starting a pod on Kubernetesrepos/data-engineering/airflow-dags!1021btulliskubernetes_pod_operator_testmain
Add an operator label to the KubernetesPodOperator taskrepos/data-engineering/airflow-dags!1018btullisadd_label_refine_test_kpomain
Remove schedule for refine_test_kpo DAGrepos/data-engineering/airflow-dags!1017btullisrefinery_test_fix_start_datemain
Add start_date to refinery_test_kpo DAGrepos/data-engineering/airflow-dags!1016btullisrefinery_test_add_start_datemain
Move the refinery test DAG to the correct instancerepos/data-engineering/airflow-dags!1014btullismove_test_refine_dagmain
Add a test DAG for the KubernetesPodOperator and refineryrepos/data-engineering/airflow-dags!1013btulliskpo_analytics_testmain
Show related patches Customize query in GitLab

Event Timeline

BTullis triaged this task as High priority.
BTullis moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.

I have created a test DAG to try running /opt/refinery/bin/refinery-drop-older-than --older 90 --allowed-interval 1 -v --database=wmf --tables=webrequest with the KubernetesPodOperator.

I would like to see what pod spec is generated for the task. My understanding from the documentation is that the task will inherit the pod spec of the KubernetesExecutor with eveything apart from those parameters that have been overridden.

So in this case it should have the required kerberos credential cache already mounted.
However, if this assumption is correct, then it may also have the airflow configuration files, which would be unnecessary for this operation.

I will check whether my assumption is correct and what the pod spec look like, before working out what else should be trimmed.

Change #1110883 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Allow specific task pods to access the kube-api

https://gerrit.wikimedia.org/r/1110883

Change #1110883 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Allow specific task pods to access the kube-api

https://gerrit.wikimedia.org/r/1110883

Change #1111206 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use the existing labels for kubernetes and spark operators

https://gerrit.wikimedia.org/r/1111206

Change #1111206 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use the existing labels for kubernetes and spark operators

https://gerrit.wikimedia.org/r/1111206

Change #1111278 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Add a separate networkpolicy for task-pods to access k8s API

https://gerrit.wikimedia.org/r/1111278

Change #1111278 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Add a separate networkpolicy for task-pods to access k8s API

https://gerrit.wikimedia.org/r/1111278

Change #1111613 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] airflow: Use an operator label to identify k8s client task pods

https://gerrit.wikimedia.org/r/1111613

Change #1111619 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: define a pod template to be used by the KubernetesPodOperator

https://gerrit.wikimedia.org/r/1111619

Change #1111613 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Use an operator label to identify k8s client task pods

https://gerrit.wikimedia.org/r/1111613

Change #1111619 merged by Brouberol:

[operations/deployment-charts@master] airflow: define pod templates enabling creating Pods from a task

https://gerrit.wikimedia.org/r/1111619

Reassigning to @brouberol to close the loop on this. We now have a working implementation of the KubernetesPodOperator loading the refinery image.
So we are close now to being able to run refinery jobs of many kinds with this operator.

brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1024

Draft: analytics_test: define DAG that displays partitions that would be deleted from HDFS

Change #1111938 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: move the serviceAccountName directly under pod.spec

https://gerrit.wikimedia.org/r/1111938

Change #1111938 merged by Brouberol:

[operations/deployment-charts@master] airflow: move the serviceAccountName directly under pod.spec

https://gerrit.wikimedia.org/r/1111938

After we have shown that it works, we may with to modify the python_script_executor, or otherwise modify the DAGs themselves to use the new method.

Just checked in with @brouberol ; I'll take a stab at this while the ops side of the work is ongoing, so we can integrate soon-ish. I'll keep you in the loop with changes.

Change #1112053 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: refactor/DRY the volume/volumeMounts accross containers

https://gerrit.wikimedia.org/r/1112053

Change #1112057 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: define a K8sPodOperator pod template for pods needing access to hadoop

https://gerrit.wikimedia.org/r/1112057

Change #1112052 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: deploy hive config under both hadoop and spark config dirs

https://gerrit.wikimedia.org/r/1112052

Change #1112052 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: deploy hive config under both hadoop and spark config dirs

https://gerrit.wikimedia.org/r/1112052

Change #1112053 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: refactor/DRY the volume/volumeMounts accross containers

https://gerrit.wikimedia.org/r/1112053

Change #1112057 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: define a K8sPodOperator pod template for pods needing access to hadoop

https://gerrit.wikimedia.org/r/1112057

Change #1112210 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: hotfix: do not render empty volumes/volumeMounts blocks

https://gerrit.wikimedia.org/r/1112210

Change #1112210 merged by Brouberol:

[operations/deployment-charts@master] airflow: hotfix: do not render empty volumes/volumeMounts blocks

https://gerrit.wikimedia.org/r/1112210

Change #1112213 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: hotfix: fix broken indentation in the pod template configmaps

https://gerrit.wikimedia.org/r/1112213

Change #1112213 merged by Brouberol:

[operations/deployment-charts@master] airflow: hotfix: fix broken indentation in the pod template configmaps

https://gerrit.wikimedia.org/r/1112213

Change #1112723 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: de-indent environment variables

https://gerrit.wikimedia.org/r/1112723

Change #1112724 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: refactor env injection

https://gerrit.wikimedia.org/r/1112724

Change #1112723 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: de-indent environment variables

https://gerrit.wikimedia.org/r/1112723

Change #1112724 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: refactor env injection

https://gerrit.wikimedia.org/r/1112724

brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1024

test_k8s: define DAG that displays partitions that would be deleted from HDFS

All done! We can now run refinery commands in a container using KubernetesPodOperator. See example.

Change #1114745 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] airflow: Update the default package version

https://gerrit.wikimedia.org/r/1114745

Change #1114745 merged by Btullis:

[operations/puppet@production] airflow: Update the default package version

https://gerrit.wikimedia.org/r/1114745