Whilst working on T380621: Migrate the airflow-search scheduler to Kubernetes we observed that certain DAG tasks were attempting to use the BashOperator to call refine scripts.
The specific case is that drop_old_data_daily was trying to run:
PYTHONPATH=/srv/deployment/analytics/refinery/python /usr/bin/python3 /srv/deployment/analytics/refinery/bin/refinery-drop-older-than
This worked when the scheduler was running on an-airflow1005 and was configured to use the LocalExecutor, because refinery is deployed to an-airflow1005. However, these scripts and their python libraries are not available in our airflow images.
We have decided that we would like to implement this by using the KubernetesPodOperator with a custom refinery image.
The image has been created in T383417: Create a container image for analytics/refinery to be used with Airflow tasks and is now available for use.
We will probably want to use a pod_template_file in the same way as the executor, although we will not need the same airflow configuration files.
We will need the following ConfigMap objects to be available in the pods:
- Hadoop configuration files
- Kerberos credential cache
We can start by making a suitable DAG from scratch in the test cluster.
After we have shown that it works, we may with to modify the python_script_executor, or otherwise modify the DAGs themselves to use the new method.
Note that an alternative to the pod template file is described here: How to use cluster ConfigMaps, Secrets, and Volumes with Pod.
This option may also be useful to explore.