Event Timeline
Comment Actions
After setting container_resources to request 1 AMD GPU, the DAG fails as shown in the airflow logs below:
example-ml-training-pipeline-train-model-3t3x2tl3
▶ Log message source details
[2025-09-04, 14:53:04 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 14:53:05 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 14:53:05 UTC] {pod.py:1276} INFO - Building pod train-model-1zbbame with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T144543.8750550000-05c084189', 'kubernetes_pod_operator': 'True', 'try_number': '2'}
[2025-09-04, 14:53:05 UTC] {pod.py:573} INFO - Found matching pod train-model-1zbbame with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T144543.8750550000-05c084189', 'task_id': 'train_model', 'try_number': '2'}
[2025-09-04, 14:53:05 UTC] {pod.py:574} INFO - `try_number` of task_instance: 2
[2025-09-04, 14:53:05 UTC] {pod.py:575} INFO - `try_number` of pod: 2
[2025-09-04, 14:53:05 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-1zbbame to dse-k8s-worker1001.eqiad.wmnet from None
[2025-09-04, 14:53:05 UTC] {pod_manager.py:410} ▼ Waiting until 120s to get the POD scheduled...
[2025-09-04, 14:53:05 UTC] {pod_manager.py:431} INFO - Waiting 120s to get the POD running...
[2025-09-04, 14:53:10 UTC] {pod_manager.py:390} INFO - The Pod has an Event: AttachVolume.Attach succeeded for volume "pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116" from None
[2025-09-04, 14:53:15 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Pulling image "docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476" from spec.containers{base}
[2025-09-04, 14:55:06 UTC] {pod_manager.py:434} ▲▲▲ Log group end
[2025-09-04, 14:55:06 UTC] {pod.py:1122} INFO - Deleting pod: train-model-1zbbame
[2025-09-04, 14:55:06 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 680, in execute_sync
self.await_pod_start(pod=self.pod)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 611, in await_pod_start
loop.run_until_complete(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 435, in await_pod_start
raise PodLaunchFailedException(
airflow.providers.cncf.kubernetes.utils.pod_manager.PodLaunchFailedException: Pod took too long to start. More than 120s. Check the pod events in kubernetes.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
return func(self, *args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
return self.execute_sync(context)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
self.cleanup(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod train-model-1zbbame returned a failure.
remote_pod: {'api_version': None,
'kind': None,
'metadata': {'annotations': {'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default',
'kubernetes.io/limit-ranger': 'LimitRanger '
'plugin set: cpu, '
'memory request '
'for container '
'base; cpu, memory '
'limit for '
'container base'},
'creation_timestamp': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'airflow_kpo_in_cluster': 'True',
'airflow_version': '2.10.5',
'app': 'airflow',
'component': 'task-pod',
'dag_id': 'example_ml_training_pipeline',
'kubernetes_pod_operator': 'True',
'release': 'dev-kevinbazira',
'routed_via': 'dev-kevinbazira',
'run_id': 'manual__2025-09-04T144543.8750550000-05c084189',
'task_id': 'train_model',
'try_number': '2'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:airflow_kpo_in_cluster': {},
'f:airflow_version': {},
'f:app': {},
'f:component': {},
'f:dag_id': {},
'f:kubernetes_pod_operator': {},
'f:release': {},
'f:routed_via': {},
'f:run_id': {},
'f:task_id': {},
'f:try_number': {}}},
'f:spec': {'f:affinity': {},
'f:containers': {'k:{"name":"base"}': {'.': {},
'f:args': {},
'f:command': {},
'f:env': {'.': {},
'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
'f:name': {},
'f:value': {}}},
'f:image': {},
'f:imagePullPolicy': {},
'f:name': {},
'f:resources': {'.': {},
'f:limits': {'.': {},
'f:amd.com/gpu': {}},
'f:requests': {'.': {},
'f:amd.com/gpu': {}}},
'f:securityContext': {'.': {},
'f:allowPrivilegeEscalation': {},
'f:capabilities': {'.': {},
'f:drop': {}},
'f:runAsNonRoot': {},
'f:seccompProfile': {'.': {},
'f:type': {}}},
'f:terminationMessagePath': {},
'f:terminationMessagePolicy': {},
'f:volumeMounts': {'.': {},
'k:{"mountPath":"/mnt/model-training"}': {'.': {},
'f:mountPath': {},
'f:name': {}}}}},
'f:dnsPolicy': {},
'f:enableServiceLinks': {},
'f:priorityClassName': {},
'f:restartPolicy': {},
'f:schedulerName': {},
'f:securityContext': {},
'f:terminationGracePeriodSeconds': {},
'f:volumes': {'.': {},
'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
'f:name': {},
'f:persistentVolumeClaim': {'.': {},
'f:claimName': {}}}}}},
'manager': 'OpenAPI-Generator',
'operation': 'Update',
'subresource': None,
'time': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal())}],
'name': 'train-model-1zbbame',
'namespace': 'airflow-dev',
'owner_references': None,
'resource_version': '745273500',
'self_link': None,
'uid': '86c42517-62cc-4dd4-aca9-1e4fd06872f9'},
'spec': {'active_deadline_seconds': None,
'affinity': {'node_affinity': None,
'pod_affinity': None,
'pod_anti_affinity': None},
'automount_service_account_token': None,
'containers': [{'args': ['ls -l /mnt/model-training && python3 '
'training/example/train/src/example/train_model.py '
'--input-path '
'/mnt/model-training/example_etl_output.parquet '
'--output-model-path '
'/mnt/model-training/model.pkl && ls -l '
'/mnt/model-training'],
'command': ['bash', '-c'],
'env': [{'name': 'REQUESTS_CA_BUNDLE',
'value': '/etc/ssl/certs/ca-certificates.crt',
'value_from': None},
{'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
'value': 'WHEN_REQUIRED',
'value_from': None},
{'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
'value': 'WHEN_REQUIRED',
'value_from': None}],
'env_from': None,
'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
'image_pull_policy': 'IfNotPresent',
'lifecycle': None,
'liveness_probe': None,
'name': 'base',
'ports': None,
'readiness_probe': None,
'resize_policy': None,
'resources': {'claims': None,
'limits': {'amd.com/gpu': '1',
'cpu': '100m',
'memory': '100Mi'},
'requests': {'amd.com/gpu': '1',
'cpu': '100m',
'memory': '100Mi'}},
'restart_policy': None,
'security_context': {'allow_privilege_escalation': False,
'app_armor_profile': None,
'capabilities': {'add': None,
'drop': ['ALL']},
'privileged': None,
'proc_mount': None,
'read_only_root_filesystem': None,
'run_as_group': None,
'run_as_non_root': True,
'run_as_user': None,
'se_linux_options': None,
'seccomp_profile': {'localhost_profile': None,
'type': 'RuntimeDefault'},
'windows_options': None},
'startup_probe': None,
'stdin': None,
'stdin_once': None,
'termination_message_path': '/dev/termination-log',
'termination_message_policy': 'File',
'tty': None,
'volume_devices': None,
'volume_mounts': [{'mount_path': '/mnt/model-training',
'mount_propagation': None,
'name': 'airflow-ml-model-training-volume',
'read_only': None,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None},
{'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
'mount_propagation': None,
'name': 'kube-api-access-clmn4',
'read_only': True,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None}],
'working_dir': None}],
'dns_config': None,
'dns_policy': 'ClusterFirst',
'enable_service_links': True,
'ephemeral_containers': None,
'host_aliases': None,
'host_ipc': None,
'host_network': None,
'host_pid': None,
'host_users': None,
'hostname': None,
'image_pull_secrets': None,
'init_containers': None,
'node_name': 'dse-k8s-worker1001.eqiad.wmnet',
'node_selector': None,
'os': None,
'overhead': None,
'preemption_policy': 'PreemptLowerPriority',
'priority': -100,
'priority_class_name': 'low-priority-pod',
'readiness_gates': None,
'resource_claims': None,
'resources': None,
'restart_policy': 'Never',
'runtime_class_name': None,
'scheduler_name': 'default-scheduler',
'scheduling_gates': None,
'security_context': {'app_armor_profile': None,
'fs_group': None,
'fs_group_change_policy': None,
'run_as_group': None,
'run_as_non_root': None,
'run_as_user': None,
'se_linux_change_policy': None,
'se_linux_options': None,
'seccomp_profile': None,
'supplemental_groups': None,
'supplemental_groups_policy': None,
'sysctls': None,
'windows_options': None},
'service_account': 'default',
'service_account_name': 'default',
'set_hostname_as_fqdn': None,
'share_process_namespace': None,
'subdomain': None,
'termination_grace_period_seconds': 30,
'tolerations': [{'effect': 'NoExecute',
'key': 'node.kubernetes.io/not-ready',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None},
{'effect': 'NoExecute',
'key': 'node.kubernetes.io/unreachable',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None}],
'topology_spread_constraints': None,
'volumes': [{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'airflow-ml-model-training-volume',
'nfs': None,
'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
'read_only': None},
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': None,
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None},
{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'kube-api-access-clmn4',
'nfs': None,
'persistent_volume_claim': None,
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': {'default_mode': 420,
'sources': [{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': None,
'secret': None,
'service_account_token': {'audience': None,
'expiration_seconds': 3607,
'path': 'token'}},
{'cluster_trust_bundle': None,
'config_map': {'items': [{'key': 'ca.crt',
'mode': None,
'path': 'ca.crt'}],
'name': 'kube-root-ca.crt',
'optional': None},
'downward_api': None,
'secret': None,
'service_account_token': None},
{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
'field_path': 'metadata.namespace'},
'mode': None,
'path': 'namespace',
'resource_field_ref': None}]},
'secret': None,
'service_account_token': None}]},
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None}]},
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': None,
'ephemeral_container_statuses': None,
'host_i_ps': None,
'host_ip': None,
'init_container_statuses': None,
'message': None,
'nominated_node_name': None,
'phase': 'Pending',
'pod_i_ps': None,
'pod_ip': None,
'qos_class': 'Guaranteed',
'reason': None,
'resize': None,
'resource_claim_statuses': None,
'start_time': None}}
[2025-09-04, 14:55:06 UTC] {taskinstance.py:1226} INFO - Marking task as UP_FOR_RETRY. dag_id=example_ml_training_pipeline, task_id=train_model, run_id=manual__2025-09-04T14:45:43.875055+00:00, execution_date=20250904T144543, start_date=20250904T145304, end_date=20250904T145506
[2025-09-04, 14:55:06 UTC] {taskinstance.py:341} ▶ Post task execution logsComment Actions
I started the DAG while streaming changes, and it looks like the WMFKubernetesPodOperator creates the task pod, but the pod created by the task runs into a memory issue:
kevinbazira@deploy1003:~$ kube_env airflow-dev dse-k8s-eqiad kevinbazira@deploy1003:~$ kubectl get pods -w NAME READY STATUS RESTARTS AGE ... example-ml-training-pipeline-train-model-0mkmkk47 0/1 Pending 0 0s example-ml-training-pipeline-train-model-0mkmkk47 0/1 Pending 0 0s example-ml-training-pipeline-train-model-0mkmkk47 0/1 ContainerCreating 0 1s example-ml-training-pipeline-train-model-0mkmkk47 0/1 ContainerCreating 0 13s example-ml-training-pipeline-train-model-0mkmkk47 1/1 Running 0 13s train-model-ybumr3o 0/1 Pending 0 0s train-model-ybumr3o 0/1 Pending 0 0s train-model-ybumr3o 0/1 ContainerCreating 0 0s train-model-ybumr3o 0/1 ContainerCreating 0 10s train-model-ybumr3o 1/1 Running 0 11s train-model-ybumr3o 0/1 OOMKilled 0 20s train-model-ybumr3o 0/1 OOMKilled 0 21s train-model-ybumr3o 0/1 OOMKilled 0 22s train-model-ybumr3o 0/1 OOMKilled 0 23s train-model-ybumr3o 0/1 Terminating 0 23s train-model-ybumr3o 0/1 Terminating 0 23s ...
Comment Actions
After setting container_resources for CPU, memory, and GPU, the DAG succeeded as shown below:
example-ml-training-pipeline-train-model-fnlyouii
▶ Log message source details
[2025-09-04, 16:50:17 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 16:50:18 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 16:50:18 UTC] {pod.py:1276} INFO - Building pod train-model-bc4ar60 with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T165010.5003430000-03df633e4', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2025-09-04, 16:50:18 UTC] {pod.py:573} INFO - Found matching pod train-model-bc4ar60 with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T165010.5003430000-03df633e4', 'task_id': 'train_model', 'try_number': '1'}
[2025-09-04, 16:50:18 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
[2025-09-04, 16:50:18 UTC] {pod.py:575} INFO - `try_number` of pod: 1
[2025-09-04, 16:50:18 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-bc4ar60 to dse-k8s-worker1001.eqiad.wmnet from None
[2025-09-04, 16:50:18 UTC] {pod_manager.py:410} ▼ Waiting until 120s to get the POD scheduled...
[2025-09-04, 16:50:18 UTC] {pod_manager.py:431} INFO - Waiting 120s to get the POD running...
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: AttachVolume.Attach succeeded for volume "pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116" from None
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Container image "docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476" already present on machine from spec.containers{base}
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Created container base from spec.containers{base}
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Started container base from spec.containers{base}
[2025-09-04, 16:50:28 UTC] {pod_manager.py:418} ▲▲▲ Log group end
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Sep 4 09:37 example_etl_output.parquet
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 85 Sep 4 12:25 model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 0 Sep 4 09:37 test_write.txt
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 3 runuser runuser 4096 Aug 28 13:00 training
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA is available.
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA device count: 1
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA device name: AMD Radeon Pro WX 9100
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] Loaded DataFrame with shape: (200, 4)
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] Dummy model artifact written to /mnt/model-training/model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Sep 4 09:37 example_etl_output.parquet
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 85 Sep 4 16:50 model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 0 Sep 4 09:37 test_write.txt
[2025-09-04, 16:50:33 UTC] {pod_manager.py:555} INFO - [base] drwxrwsr-x 3 runuser runuser 4096 Aug 28 13:00 training
[2025-09-04, 16:50:33 UTC] {pod.py:1122} INFO - Deleting pod: train-model-bc4ar60
[2025-09-04, 16:50:33 UTC] {taskinstance.py:341} ▶ Post task execution logsHuge thanks to @brouberol, @klausman, and @elukey for helping with this <3