Page MenuHomePhabricator

Create a container image for analytics/refinery to be used with Airflow tasks
Closed, ResolvedPublic

Description

During the work to T362788: Migrate Airflow to the dse-k8s cluster we discovered that a number of tasks that are currently using the BashOperator are attempting to run some scripts that are part of analytics/refinery.

These tasks do not work in the Kubernetes environment, because refinery is only deployed to certain target hosts (1, 2, 3) and also to HDFS.

One approach that we may wish to use is to create a refinery job artifact using the conda and WMF Workflow Utils based approach.

However, another approach that is available to us is to create a refinery container image. This will contain all of the scripts and libraries required, as well as the underlying CLI utilities such as hdfs, hive, yarn, mysql etc.

We will then be able to launch this using the KubernetesPodOperator from within a DAG.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Fix the refinery python module pathrepos/data-engineering/refinery!3btullisfix_refinery_pathmain
Add the PYTHONPATH and update the PATH variablesrepos/data-engineering/refinery!2btullisadd_env_varsmain
Add data-engineering/refinery to the trusted runnersrepos/releng/gitlab-trusted-runner!106btullisadd_refinerymain
Add data-engineering/refinery to the trusted runnersrepos/releng/gitlab-trusted-runner!104btullismainmain
Customize query in GitLab

Event Timeline

BTullis triaged this task as High priority.

This image is now published and usable.

btullis@marlin:~$ docker run -it docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
Unable to find image 'docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5' locally
2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5: Pulling from repos/data-engineering/refinery
7b45c6d330c8: Already exists 
ef2ebb48f9ce: Already exists 
526e23257365: Already exists 
a8d6e7c24a3f: Already exists 
c2665232a772: Already exists 
4f4fb700ef54: Already exists 
56c4fcf0234b: Pull complete 
b3542690eb1a: Pull complete 
Digest: sha256:2952e9d4eb2ab6e7c49c1f0cec5a6fe77fc30af012a1f8ed52942954fec4b9c0
Status: Downloaded newer image for docker-registry.wikimedia.org/repos/data-engineering/refinery:2025-01-10-165424-6efbd5adbcf50d11d3be1cd20542308cad6e7ae5
runuser@6e65fef54633:/opt/refinery$ refinery-drop-older-than --help
Drops Hive partitions and removes data directories older than a threshold.

Usage: refinery-drop-older-than [options]