Page MenuHomePhabricator

Deploy airflow images from airflow-dags repository build
Closed, ResolvedPublic

Assigned To
Authored By
Stevemunene
Oct 29 2025, 3:33 PM
Referenced Files
F69905613: image.png
Nov 5 2025, 10:42 AM
F69905604: image.png
Nov 5 2025, 10:42 AM
F69905596: image.png
Nov 5 2025, 10:42 AM
F69904714: image.png
Nov 5 2025, 10:10 AM
F69904712: image.png
Nov 5 2025, 10:10 AM

Description

We have been using the airflow build process documented on https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Upgrading to build our base images for the airflow instances for a while now. However, with the airflow-dags repo refactor with poetry we wish to test the functionality of the new images built with this method as we look into future operations.
For this test we shall build the image with the current repository and deploy on airflow-test-k8s to check the functionality of the image and the overall airflow functions.

Event Timeline

Change #1199819 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build

https://gerrit.wikimedia.org/r/1199819

Change #1199819 merged by jenkins-bot:

[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build

https://gerrit.wikimedia.org/r/1199819

Deployed this but getting an error on kerberos

stevemunene@deploy2002:~$ kubectl logs -f airflow-kerberos-776ff68846-6gcc9
Traceback (most recent call last):
  File "/home/app/.local/bin/airflow", line 5, in <module>
    from airflow.__main__ import main
ModuleNotFoundError: No module named 'airflow'

airflow-kerberos-776ff68846-6gcc9                                 0/1     CrashLoopBackOff        4 (14s ago)     2m30s
airflow-webserver-854cb66975-v66ml                                0/2     Init:CrashLoopBackOff   4 (17s ago)     2m39s
airflow-kerberos-776ff68846-6gcc9                                 0/1     Error                   5 (92s ago)     3m48s
airflow-webserver-854cb66975-v66ml                                0/2     Init:Error              5 (89s ago)     3m51s
airflow-scheduler-d77cb95b7-hhmd5                                 0/1     Error                   5 (87s ago)     3m51s
airflow-scheduler-d77cb95b7-hhmd5                                 0/1     CrashLoopBackOff        5 (7s ago)      3m58s
airflow-kerberos-776ff68846-6gcc9                                 0/1     CrashLoopBackOff        5 (13s ago)     3m59s
airflow-webserver-854cb66975-v66ml                                0/2     Init:CrashLoopBackOff   5 (15s ago)     4m

Looking into this..

Change #1200339 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] airflow: Update the pythonpath

https://gerrit.wikimedia.org/r/1200339

Change #1200339 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: Update the pythonpath

https://gerrit.wikimedia.org/r/1200339

Since we still have some dumps running, we shall continue the tests on a devenv and proceed on the test-k8s instance once all is verified

Created a devenv to test the recent upgrade of upgrade flask-appbuilder to solve some of the initial challenges we had accessing the connection/list/ and variable/list/ pages.
To confirm, these are currently accessible using the default image as below

image.png (342×743 px, 28 KB)

image.png (393×743 px, 38 KB)

To test the changes I created a devenv with the latest image-tag containing these changes
specified the image,executor_pod_image,version,executor_pod_image_version

stevemunene@deploy2002:~$ airflow-devenv create --dags-folder test_k8s --branch=test_airflow_dags_build --set app.image=repos/data-engineering/airflow-dags --set app.version=airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69 --set app.executor_pod_image=repos/data-engineering/airflow-dags --set app.executor_pod_image_version=airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69
Airflow development environment provisioning parameters:
- instance name: dev-stevemunene
- pulled git branch: test_airflow_dags_build
- airflow-dags folder: test_k8s

Installation progress:
- Creating PG database dev_stevemunene ✅
- Installing airflow dev environment dev-stevemunene ✅
- Waiting for the kerberos pod to start .......... ✅
Password for stevemunene@WIKIMEDIA: 
- Kerberos credentials successfully setup ✅
- Waiting for the scheduler pod to be ready .......✅
- Forcing DAG serialization ✅
- Waiting for the webserver pod to be ready ✅
- Creating admin user (username: admin, password: admin) ✅

Your airflow development environment is fully set up. You can now execute `airflow-devenv expose dev-stevemunene` to expose the UI on your local development host.

Then exposed it to my local station

stevemunene@deploy2002:~$ airflow-devenv expose dev-stevemunene
On your workstation, run the following command:
ssh -N deployment.eqiad.wmnet -L 127.0.0.1:8080:127.0.0.1:60587
then open http://localhost:8080

Forwarding from 127.0.0.1:60587 -> 8080

However, this is oddly giving connection refused errors, looking into the logs as to why this might be the case.

➜  ~ ssh -N deploy2002.codfw.wmnet -L 127.0.0.1:8080:127.0.0.1:34331
channel 1: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused

The issue was with the port, re exposed and accessed

stevemunene@deploy2002:~$ airflow-devenv expose dev-stevemunene
On your workstation, run the following command:
ssh -N deployment.eqiad.wmnet -L 127.0.0.1:8080:127.0.0.1:35731
then open http://localhost:8080

Forwarding from 127.0.0.1:35731 -> 8080

on the local host

➜  ~ ssh -N deploy2002.codfw.wmnet  -L 127.0.0.1:8080:127.0.0.1:35731

The airflow image is running as expected

image.png (625×1 px, 94 KB)

with the previously inaccessible pages now accessible http://localhost:8080/connection/list/
image.png (625×1 px, 40 KB)

and http://localhost:8080/variable/list/
image.png (625×1 px, 51 KB)

Ran the DAG: test_email_notification A DAG meant to fail, to test email alerting
which worked as expected as well.
We should be okay to move testing to airflow-test-k8s once more.

Change #1202106 had a related patch set uploaded (by Stevemunene; author: Stevemunene):

[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build

https://gerrit.wikimedia.org/r/1202106

Change #1202106 merged by Brouberol:

[operations/deployment-charts@master] Deploy airflow images from airflow-dags repository build

https://gerrit.wikimedia.org/r/1202106

I've deployed the new image to airflow-test-k8s and manually kicked off 5 DAGs, all of which ran successfully. I'm going to leave things simmer during the weekend, and roll the change out on monday if everything ran fine.

Change #1203379 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: migrate to the image defined in the airflow-dags repo

https://gerrit.wikimedia.org/r/1203379

Change #1203380 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: deploy the image tested on test-k8s to all instances

https://gerrit.wikimedia.org/r/1203380

Change #1203379 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: migrate to the image defined in the airflow-dags repo

https://gerrit.wikimedia.org/r/1203379

Change #1203380 merged by jenkins-bot:

[operations/deployment-charts@master] airflow: deploy the image tested on test-k8s to all instances

https://gerrit.wikimedia.org/r/1203380

Change #1203387 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-analytics-test: use the common airflow image

https://gerrit.wikimedia.org/r/1203387

Change #1203387 merged by Brouberol:

[operations/deployment-charts@master] airflow-analytics-test: use the common airflow image

https://gerrit.wikimedia.org/r/1203387

I've redeployed the new airflow-dags image to every production airflow instance:

root@deploy2002:~# kubectl get pod -A -l app=airflow,component=scheduler -o json | jq -r '.items[].spec.containers[0].image' | sort | uniq -c
     10 docker-registry.discovery.wmnet/repos/data-engineering/airflow-dags:airflow-2.10.5-py3.11-2025-11-04-144729-15a48fbbf17c764c194e0acf87df0e389226d59d@sha256:de985e6946b2a66a49a567700e92bdbf5e1a2b744d84cfa14d65e6c7f8f6cb69

Change #1203451 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: assume the PYTHONPATH env var is defined in the airflow image

https://gerrit.wikimedia.org/r/1203451

I had to rollback to using the previous image, as all NamedHivePartitionSensor tasks started to fail with

[2025-11-10, 13:31:04 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
    return ExecutionCallableRunner(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/sensors/base.py", line 309, in execute
    raise e
  File "/home/app/.local/lib/python3.11/site-packages/airflow/sensors/base.py", line 289, in execute
    poke_return = self.poke(context)
                  ^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/sensors/named_hive_partition.py", line 102, in poke
    if not self.poke_partition(self.partition_names[self.next_index_to_poke]):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/sensors/named_hive_partition.py", line 95, in poke_partition
    return self.hook.check_for_named_partition(schema, table, partition)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/airflow/providers/apache/hive/hooks/hive.py", line 649, in check_for_named_partition
    return client.check_for_named_partition(schema, table, partition_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/app/.local/lib/python3.11/site-packages/hmsclient/hmsclient.py", line 152, in check_for_named_partition
    self.get_partition_by_name(db_name, table_name, partition)
  File "/home/app/.local/lib/python3.11/site-packages/hmsclient/genthrift/hive_metastore/ThriftHiveMetastore.py", line 3285, in get_partition_by_name
    self.send_get_partition_by_name(db_name, tbl_name, part_name)
  File "/home/app/.local/lib/python3.11/site-packages/hmsclient/genthrift/hive_metastore/ThriftHiveMetastore.py", line 3296, in send_get_partition_by_name
    self._oprot.trans.flush()
  File "/home/app/.local/lib/python3.11/site-packages/thrift_sasl/__init__.py", line 132, in flush
    raise TTransportException(type=TTransportException.UNKNOWN,
thrift.transport.TTransport.TTransportException: (('Invalid token was supplied', 589824), ('Token header is malformed or corrupt', -2045022964))

Change #1204610 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-test-k8s: test the new image with sasl compiled for python3.11

https://gerrit.wikimedia.org/r/1204610

Change #1204610 merged by Brouberol:

[operations/deployment-charts@master] airflow-test-k8s: test the new image with sasl compiled for python3.11

https://gerrit.wikimedia.org/r/1204610

Change #1204864 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow: release new image

https://gerrit.wikimedia.org/r/1204864

Change #1204864 merged by Brouberol:

[operations/deployment-charts@master] airflow: release new image

https://gerrit.wikimedia.org/r/1204864

We've built a new image with the sasl package compiled for python3.11, and it seems to solve the hive connection issue we've been seeing, as seen in airflow-test-k8s.wikimedia.org / hive_partition_sensorl. I've deployed the image to airflow-analytics-test and will let it simmer until monday.

All airflow instances have been redeployed with the new image. NamedHivePartitionSensor tasks are working as expected.

Change #1203451 merged by Gehel:

[operations/deployment-charts@master] airflow: assume the PYTHONPATH env var is defined in the airflow image

https://gerrit.wikimedia.org/r/1203451

Change #1206380 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/deployment-charts@master] airflow: update base image

https://gerrit.wikimedia.org/r/1206380

Change #1206380 merged by Gehel:

[operations/deployment-charts@master] airflow: update base image

https://gerrit.wikimedia.org/r/1206380

Change #1206399 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/deployment-charts@master] Airflow: bump image version

https://gerrit.wikimedia.org/r/1206399

Change #1206399 merged by Gehel:

[operations/deployment-charts@master] Airflow: bump image version

https://gerrit.wikimedia.org/r/1206399

I'll now redeploy a new chart without the PYTHONPATH env var definition, as well as a new image with PYTHONPATH being defined in the blubberfile, which should get us to a situation in which a python version change in the image will not require a chart change. After that, we should be able to close.

The new image has been deployed everywhere!