We have been using the airflow build process documented on https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Upgrading to build our base images for the airflow instances for a while now. However, with the airflow-dags repo refactor with poetry we wish to test the functionality of the new images built with this method as we look into future operations.
For this test we shall build the image with the current repository and deploy on airflow-test-k8s to check the functionality of the image and the overall airflow functions.
Description
Details
Related Objects
Event Timeline
Change #1199819 had a related patch set uploaded (by Stevemunene; author: Stevemunene):
[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build
Change #1199819 merged by jenkins-bot:
[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build
Deployed this but getting an error on kerberos
stevemunene@deploy2002:~$ kubectl logs -f airflow-kerberos-776ff68846-6gcc9
Traceback (most recent call last):
File "/home/app/.local/bin/airflow", line 5, in <module>
from airflow.__main__ import main
ModuleNotFoundError: No module named 'airflow'
airflow-kerberos-776ff68846-6gcc9 0/1 CrashLoopBackOff 4 (14s ago) 2m30s
airflow-webserver-854cb66975-v66ml 0/2 Init:CrashLoopBackOff 4 (17s ago) 2m39s
airflow-kerberos-776ff68846-6gcc9 0/1 Error 5 (92s ago) 3m48s
airflow-webserver-854cb66975-v66ml 0/2 Init:Error 5 (89s ago) 3m51s
airflow-scheduler-d77cb95b7-hhmd5 0/1 Error 5 (87s ago) 3m51s
airflow-scheduler-d77cb95b7-hhmd5 0/1 CrashLoopBackOff 5 (7s ago) 3m58s
airflow-kerberos-776ff68846-6gcc9 0/1 CrashLoopBackOff 5 (13s ago) 3m59s
airflow-webserver-854cb66975-v66ml 0/2 Init:CrashLoopBackOff 5 (15s ago) 4mLooking into this..
Change #1200339 had a related patch set uploaded (by Stevemunene; author: Stevemunene):
[operations/deployment-charts@master] airflow: Update the pythonpath
Change #1200339 merged by jenkins-bot:
[operations/deployment-charts@master] airflow: Update the pythonpath
Since we still have some dumps running, we shall continue the tests on a devenv and proceed on the test-k8s instance once all is verified
Created a devenv to test the recent upgrade of upgrade flask-appbuilder to solve some of the initial challenges we had accessing the connection/list/ and variable/list/ pages.
To confirm, these are currently accessible using the default image as below
To test the changes I created a devenv with the latest image-tag containing these changes
specified the image,executor_pod_image,version,executor_pod_image_version
stevemunene@deploy2002:~$ airflow-devenv create --dags-folder test_k8s --branch=test_airflow_dags_build --set app.image=repos/data-engineering/airflow-dags --set app.version=airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69 --set app.executor_pod_image=repos/data-engineering/airflow-dags --set app.executor_pod_image_version=airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69 Airflow development environment provisioning parameters: - instance name: dev-stevemunene - pulled git branch: test_airflow_dags_build - airflow-dags folder: test_k8s Installation progress: - Creating PG database dev_stevemunene ✅ - Installing airflow dev environment dev-stevemunene ✅ - Waiting for the kerberos pod to start .......... ✅ Password for stevemunene@WIKIMEDIA: - Kerberos credentials successfully setup ✅ - Waiting for the scheduler pod to be ready .......✅ - Forcing DAG serialization ✅ - Waiting for the webserver pod to be ready ✅ - Creating admin user (username: admin, password: admin) ✅ Your airflow development environment is fully set up. You can now execute `airflow-devenv expose dev-stevemunene` to expose the UI on your local development host.
Then exposed it to my local station
stevemunene@deploy2002:~$ airflow-devenv expose dev-stevemunene On your workstation, run the following command: ssh -N deployment.eqiad.wmnet -L 127.0.0.1:8080:127.0.0.1:60587 then open http://localhost:8080 Forwarding from 127.0.0.1:60587 -> 8080
However, this is oddly giving connection refused errors, looking into the logs as to why this might be the case.
➜ ~ ssh -N deploy2002.codfw.wmnet -L 127.0.0.1:8080:127.0.0.1:34331 channel 1: open failed: connect failed: Connection refused channel 2: open failed: connect failed: Connection refused
The issue was with the port, re exposed and accessed
stevemunene@deploy2002:~$ airflow-devenv expose dev-stevemunene On your workstation, run the following command: ssh -N deployment.eqiad.wmnet -L 127.0.0.1:8080:127.0.0.1:35731 then open http://localhost:8080 Forwarding from 127.0.0.1:35731 -> 8080
on the local host
➜ ~ ssh -N deploy2002.codfw.wmnet -L 127.0.0.1:8080:127.0.0.1:35731
The airflow image is running as expected
with the previously inaccessible pages now accessible http://localhost:8080/connection/list/
and http://localhost:8080/variable/list/
Ran the DAG: test_email_notification A DAG meant to fail, to test email alerting
which worked as expected as well.
We should be okay to move testing to airflow-test-k8s once more.
Change #1202106 had a related patch set uploaded (by Stevemunene; author: Stevemunene):
[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build
Change #1202106 merged by Brouberol:
[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build
I've deployed the new image to airflow-test-k8s and manually kicked off 5 DAGs, all of which ran successfully. I'm going to leave things simmer during the weekend, and roll the change out on monday if everything ran fine.
Change #1203379 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: migrate to the image defined in the airflow-dags repo
Change #1203380 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: deploy the image tested on test-k8s to all instances
Change #1203379 merged by jenkins-bot:
[operations/deployment-charts@master] airflow: migrate to the image defined in the airflow-dags repo
Change #1203380 merged by jenkins-bot:
[operations/deployment-charts@master] airflow: deploy the image tested on test-k8s to all instances
Change #1203387 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-analytics-test: use the common airflow image
Change #1203387 merged by Brouberol:
[operations/deployment-charts@master] airflow-analytics-test: use the common airflow image
I've redeployed the new airflow-dags image to every production airflow instance:
root@deploy2002:~# kubectl get pod -A -l app=airflow,component=scheduler -o json | jq -r '.items[].spec.containers[0].image' | sort | uniq -c
10 docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1791
Add airflow dags and root airflow directories to the PYTHONPATH
Change #1203451 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: assume the PYTHONPATH env var is defined in the airflow image
I had to rollback to using the previous image, as all NamedHivePartitionSensor tasks started to fail with
[2025-11-10, 13:31:04 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/sensors/base.py", line 309, in execute
raise e
File "/home/app/.local/lib/python3.11/site-packages/airflow/sensors/base.py", line 289, in execute
poke_return = self.poke(context)
^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/sensors/named_hive_partition.py", line 102, in poke
if not self.poke_partition(self.partition_names[self.next_index_to_poke]):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/sensors/named_hive_partition.py", line 95, in poke_partition
return self.hook.check_for_named_partition(schema, table, partition)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/hooks/hive.py", line 649, in check_for_named_partition
return client.check_for_named_partition(schema, table, partition_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/app/.local/lib/python3.11/site-packages/hmsclient/hmsclient.py", line 152, in check_for_named_partition
self.get_partition_by_name(db_name, table_name, partition)
File "/home/app/.local/lib/python3.11/site-packages/hmsclient/genthrift/hive_metastore/ThriftHiveMetastore.py", line 3285, in get_partition_by_name
self.send_get_partition_by_name(db_name, tbl_name, part_name)
File "/home/app/.local/lib/python3.11/site-packages/hmsclient/genthrift/hive_metastore/ThriftHiveMetastore.py", line 3296, in send_get_partition_by_name
self._oprot.trans.flush()
File "/home/app/.local/lib/python3.11/site-packages/thrift_sasl/__init__.py", line 132, in flush
raise TTransportException(type=TTransportException.UNKNOWN,
thrift.transport.TTransport.TTransportException: (('Invalid token was supplied', 589824), ('Token header is malformed or corrupt', -2045022964))brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1792
Add the missing sasl library
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1792
Update pyhive to make it compatible with python3.11
brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1798
Draft: Build the sasl library from source
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1798
Build the sasl library from source
brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1800
test_k8s: add a hive partition sensor DAG
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1800
test_k8s: add a hive partition sensor DAG
Change #1204610 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow-test-k8s: test the new image with sasl compiled for python3.11
Change #1204610 merged by Brouberol:
[operations/deployment-charts@master] airflow-test-k8s: test the new image with sasl compiled for python3.11
Change #1204864 had a related patch set uploaded (by Brouberol; author: Brouberol):
[operations/deployment-charts@master] airflow: release new image
Change #1204864 merged by Brouberol:
[operations/deployment-charts@master] airflow: release new image
We've built a new image with the sasl package compiled for python3.11, and it seems to solve the hive connection issue we've been seeing, as seen in airflow-test-k8s.wikimedia.org / hive_partition_sensorl. I've deployed the image to airflow-analytics-test and will let it simmer until monday.
brouberol opened https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/134
Remove the repos/data-engineering/airflow repository
brouberol opened https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/63
Mention repo archiving
brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow/-/merge_requests/63
Mention repo archiving
All airflow instances have been redeployed with the new image. NamedHivePartitionSensor tasks are working as expected.
I've archived https://gitlab.wikimedia.org/repos/data-engineering/airflow now that it's no longer required.
gehel merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1791
Add airflow dags and root airflow directories to the PYTHONPATH
Change #1203451 merged by Gehel:
[operations/deployment-charts@master] airflow: assume the PYTHONPATH env var is defined in the airflow image
Change #1206380 had a related patch set uploaded (by Gehel; author: Gehel):
[operations/deployment-charts@master] airflow: update base image
Change #1206380 merged by Gehel:
[operations/deployment-charts@master] airflow: update base image
Change #1206399 had a related patch set uploaded (by Gehel; author: Gehel):
[operations/deployment-charts@master] Airflow: bump image version
Change #1206399 merged by Gehel:
[operations/deployment-charts@master] Airflow: bump image version
dancy merged https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/merge_requests/134
Remove the repos/data-engineering/airflow repository
I'll now redeploy a new chart without the PYTHONPATH env var definition, as well as a new image with PYTHONPATH being defined in the blubberfile, which should get us to a situation in which a python version change in the image will not require a chart change. After that, we should be able to close.




