[2025-09-04, 13:41:54 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 13:41:54 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 13:41:54 UTC] {pod.py:1276} INFO - Building pod train-model-u726z1o with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:573} INFO - Found matching pod train-model-u726z1o with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'task_id': 'train_model', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
[2025-09-04, 13:41:54 UTC] {pod.py:575} INFO - `try_number` of pod: 1
[2025-09-04, 13:41:55 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-u726z1o to dse-k8s-worker1017.eqiad.wmnet from None
[2025-09-04, 13:41:55 UTC] {pod_manager.py:410} ▶ Waiting until 120s to get the POD scheduled...
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:26 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:26 UTC] {pod_manager.py:582} WARNING - Pod train-model-u726z1o log read interrupted but container base still running. Logs generated in the last one second might get duplicated.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:27 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:28 UTC] {pod_manager.py:714} INFO - Pod train-model-u726z1o has phase Running
[2025-09-04, 13:42:30 UTC] {pod.py:1122} INFO - Deleting pod: train-model-u726z1o
[2025-09-04, 13:42:30 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
return func(self, *args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
return self.execute_sync(context)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
self.cleanup(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod train-model-u726z1o returned a failure.