Page MenuHomePhabricator
Paste P82559

Pod started by WMFKubernetesPodOperator has no access to AMD GPUs
ActivePublic

Authored by kevinbazira on Sep 4 2025, 1:49 PM.
example-ml-training-pipeline-train-model-h0heesi0
▶ Log message source details
[2025-09-04, 13:41:54 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 13:41:54 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 13:41:54 UTC] {pod.py:1276} INFO - Building pod train-model-u726z1o with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:573} INFO - Found matching pod train-model-u726z1o with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'task_id': 'train_model', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
[2025-09-04, 13:41:54 UTC] {pod.py:575} INFO - `try_number` of pod: 1
[2025-09-04, 13:41:55 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-u726z1o to dse-k8s-worker1017.eqiad.wmnet from None
[2025-09-04, 13:41:55 UTC] {pod_manager.py:410} ▶ Waiting until 120s to get the POD scheduled...
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Sep 4 09:37 example_etl_output.parquet
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 85 Sep 4 12:25 model.pkl
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 0 Sep 4 09:37 test_write.txt
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 3 runuser runuser 4096 Aug 28 13:00 training
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] Traceback (most recent call last):
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ROCm device count: 0
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] main()
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 20, in main
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] print(f"ROCm device name: {torch.cuda.get_device_name(0)}")
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:26 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:26 UTC] {pod_manager.py:582} WARNING - Pod train-model-u726z1o log read interrupted but container base still running. Logs generated in the last one second might get duplicated.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] Traceback (most recent call last):
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ROCm device count: 0
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] main()
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 20, in main
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] print(f"ROCm device name: {torch.cuda.get_device_name(0)}")
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:27 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:28 UTC] {pod_manager.py:714} INFO - Pod train-model-u726z1o has phase Running
[2025-09-04, 13:42:30 UTC] {pod.py:1122} INFO - Deleting pod: train-model-u726z1o
[2025-09-04, 13:42:30 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
return func(self, *args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
return self.execute_sync(context)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
self.cleanup(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod train-model-u726z1o returned a failure.
remote_pod: {'api_version': 'v1',
'kind': 'Pod',
'metadata': {'annotations': {'cni.projectcalico.org/containerID': '511bb84917686605961322ad194937cd6d0807ecd75c88b651aa03bbabc5be92',
'cni.projectcalico.org/podIP': '',
'cni.projectcalico.org/podIPs': '',
'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default'},
'creation_timestamp': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'airflow_kpo_in_cluster': 'True',
'airflow_version': '2.10.5',
'app': 'airflow',
'component': 'task-pod',
'dag_id': 'example_ml_training_pipeline',
'kubernetes_pod_operator': 'True',
'release': 'dev-kevinbazira',
'routed_via': 'dev-kevinbazira',
'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737',
'task_id': 'train_model',
'try_number': '1'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:airflow_kpo_in_cluster': {},
'f:airflow_version': {},
'f:app': {},
'f:component': {},
'f:dag_id': {},
'f:kubernetes_pod_operator': {},
'f:release': {},
'f:routed_via': {},
'f:run_id': {},
'f:task_id': {},
'f:try_number': {}}},
'f:spec': {'f:affinity': {},
'f:containers': {'k:{"name":"base"}': {'.': {},
'f:args': {},
'f:command': {},
'f:env': {'.': {},
'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
'f:name': {},
'f:value': {}}},
'f:image': {},
'f:imagePullPolicy': {},
'f:name': {},
'f:resources': {'.': {},
'f:limits': {'.': {},
'f:cpu': {},
'f:memory': {}},
'f:requests': {'.': {},
'f:cpu': {},
'f:memory': {}}},
'f:securityContext': {'.': {},
'f:allowPrivilegeEscalation': {},
'f:capabilities': {'.': {},
'f:drop': {}},
'f:runAsNonRoot': {},
'f:seccompProfile': {'.': {},
'f:type': {}}},
'f:terminationMessagePath': {},
'f:terminationMessagePolicy': {},
'f:volumeMounts': {'.': {},
'k:{"mountPath":"/mnt/model-training"}': {'.': {},
'f:mountPath': {},
'f:name': {}}}}},
'f:dnsPolicy': {},
'f:enableServiceLinks': {},
'f:priorityClassName': {},
'f:restartPolicy': {},
'f:schedulerName': {},
'f:securityContext': {},
'f:terminationGracePeriodSeconds': {},
'f:volumes': {'.': {},
'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
'f:name': {},
'f:persistentVolumeClaim': {'.': {},
'f:claimName': {}}}}}},
'manager': 'OpenAPI-Generator',
'operation': 'Update',
'subresource': None,
'time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal())},
{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:annotations': {'f:cni.projectcalico.org/containerID': {},
'f:cni.projectcalico.org/podIP': {},
'f:cni.projectcalico.org/podIPs': {}}}},
'manager': 'Go-http-client',
'operation': 'Update',
'subresource': 'status',
'time': datetime.datetime(2025, 9, 4, 13, 42, 3, tzinfo=tzlocal())},
{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:status': {'f:conditions': {'k:{"type":"ContainersReady"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:reason': {},
'f:status': {},
'f:type': {}},
'k:{"type":"Initialized"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:status': {},
'f:type': {}},
'k:{"type":"Ready"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:reason': {},
'f:status': {},
'f:type': {}}},
'f:containerStatuses': {},
'f:hostIP': {},
'f:phase': {},
'f:podIP': {},
'f:podIPs': {'.': {},
'k:{"ip":"10.67.30.60"}': {'.': {},
'f:ip': {}},
'k:{"ip":"2620:0:861:302:677:6c3a:553b:5fc"}': {'.': {},
'f:ip': {}}},
'f:startTime': {}}},
'manager': 'kubelet',
'operation': 'Update',
'subresource': 'status',
'time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal())}],
'name': 'train-model-u726z1o',
'namespace': 'airflow-dev',
'owner_references': None,
'resource_version': '745183545',
'self_link': None,
'uid': '6a3ae65b-7f38-4a48-b9ac-fb419e847320'},
'spec': {'active_deadline_seconds': None,
'affinity': {'node_affinity': None,
'pod_affinity': None,
'pod_anti_affinity': None},
'automount_service_account_token': None,
'containers': [{'args': ['ls -l /mnt/model-training && python3 '
'training/example/train/src/example/train_model.py '
'--input-path '
'/mnt/model-training/example_etl_output.parquet '
'--output-model-path '
'/mnt/model-training/model.pkl && ls -l '
'/mnt/model-training'],
'command': ['bash', '-c'],
'env': [{'name': 'REQUESTS_CA_BUNDLE',
'value': '/etc/ssl/certs/ca-certificates.crt',
'value_from': None},
{'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
'value': 'WHEN_REQUIRED',
'value_from': None},
{'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
'value': 'WHEN_REQUIRED',
'value_from': None}],
'env_from': None,
'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
'image_pull_policy': 'IfNotPresent',
'lifecycle': None,
'liveness_probe': None,
'name': 'base',
'ports': None,
'readiness_probe': None,
'resize_policy': None,
'resources': {'claims': None,
'limits': {'cpu': '2', 'memory': '3Gi'},
'requests': {'cpu': '1',
'memory': '1500Mi'}},
'restart_policy': None,
'security_context': {'allow_privilege_escalation': False,
'app_armor_profile': None,
'capabilities': {'add': None,
'drop': ['ALL']},
'privileged': None,
'proc_mount': None,
'read_only_root_filesystem': None,
'run_as_group': None,
'run_as_non_root': True,
'run_as_user': None,
'se_linux_options': None,
'seccomp_profile': {'localhost_profile': None,
'type': 'RuntimeDefault'},
'windows_options': None},
'startup_probe': None,
'stdin': None,
'stdin_once': None,
'termination_message_path': '/dev/termination-log',
'termination_message_policy': 'File',
'tty': None,
'volume_devices': None,
'volume_mounts': [{'mount_path': '/mnt/model-training',
'mount_propagation': None,
'name': 'airflow-ml-model-training-volume',
'read_only': None,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None},
{'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
'mount_propagation': None,
'name': 'kube-api-access-t2jw5',
'read_only': True,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None}],
'working_dir': None}],
'dns_config': None,
'dns_policy': 'ClusterFirst',
'enable_service_links': True,
'ephemeral_containers': None,
'host_aliases': None,
'host_ipc': None,
'host_network': None,
'host_pid': None,
'host_users': None,
'hostname': None,
'image_pull_secrets': None,
'init_containers': None,
'node_name': 'dse-k8s-worker1017.eqiad.wmnet',
'node_selector': None,
'os': None,
'overhead': None,
'preemption_policy': 'PreemptLowerPriority',
'priority': -100,
'priority_class_name': 'low-priority-pod',
'readiness_gates': None,
'resource_claims': None,
'resources': None,
'restart_policy': 'Never',
'runtime_class_name': None,
'scheduler_name': 'default-scheduler',
'scheduling_gates': None,
'security_context': {'app_armor_profile': None,
'fs_group': None,
'fs_group_change_policy': None,
'run_as_group': None,
'run_as_non_root': None,
'run_as_user': None,
'se_linux_change_policy': None,
'se_linux_options': None,
'seccomp_profile': None,
'supplemental_groups': None,
'supplemental_groups_policy': None,
'sysctls': None,
'windows_options': None},
'service_account': 'default',
'service_account_name': 'default',
'set_hostname_as_fqdn': None,
'share_process_namespace': None,
'subdomain': None,
'termination_grace_period_seconds': 30,
'tolerations': [{'effect': 'NoExecute',
'key': 'node.kubernetes.io/not-ready',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None},
{'effect': 'NoExecute',
'key': 'node.kubernetes.io/unreachable',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None}],
'topology_spread_constraints': None,
'volumes': [{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'airflow-ml-model-training-volume',
'nfs': None,
'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
'read_only': None},
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': None,
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None},
{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'kube-api-access-t2jw5',
'nfs': None,
'persistent_volume_claim': None,
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': {'default_mode': 420,
'sources': [{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': None,
'secret': None,
'service_account_token': {'audience': None,
'expiration_seconds': 3607,
'path': 'token'}},
{'cluster_trust_bundle': None,
'config_map': {'items': [{'key': 'ca.crt',
'mode': None,
'path': 'ca.crt'}],
'name': 'kube-root-ca.crt',
'optional': None},
'downward_api': None,
'secret': None,
'service_account_token': None},
{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
'field_path': 'metadata.namespace'},
'mode': None,
'path': 'namespace',
'resource_field_ref': None}]},
'secret': None,
'service_account_token': None}]},
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None}]},
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal()),
'message': None,
'reason': 'PodFailed',
'status': 'False',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal()),
'message': None,
'reason': 'PodFailed',
'status': 'False',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'allocated_resources': None,
'allocated_resources_status': None,
'container_id': 'containerd://cc4317edd3504105a422701df2822c10ba9b96c335e929b87af20bedfdb782b8',
'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
'image_id': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines@sha256:65f31f2bf9cf539521a2a67dec6352dcf59749eeaeec6918c41db6efa595250a',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'base',
'ready': False,
'resources': None,
'restart_count': 0,
'started': False,
'state': {'running': None,
'terminated': {'container_id': 'containerd://cc4317edd3504105a422701df2822c10ba9b96c335e929b87af20bedfdb782b8',
'exit_code': 1,
'finished_at': datetime.datetime(2025, 9, 4, 13, 42, 26, tzinfo=tzlocal()),
'message': None,
'reason': 'Error',
'signal': None,
'started_at': datetime.datetime(2025, 9, 4, 13, 42, 18, tzinfo=tzlocal())},
'waiting': None},
'user': None,
'volume_mounts': None}],
'ephemeral_container_statuses': None,
'host_i_ps': None,
'host_ip': '10.64.16.210',
'init_container_statuses': None,
'message': None,
'nominated_node_name': None,
'phase': 'Failed',
'pod_i_ps': [{'ip': '10.67.30.60'},
{'ip': '2620:0:861:302:677:6c3a:553b:5fc'}],
'pod_ip': '10.67.30.60',
'qos_class': 'Burstable',
'reason': None,
'resize': None,
'resource_claim_statuses': None,
'start_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal())}}
[2025-09-04, 13:42:30 UTC] {taskinstance.py:1226} INFO - Marking task as UP_FOR_RETRY. dag_id=example_ml_training_pipeline, task_id=train_model, run_id=manual__2025-09-04T13:41:48.513453+00:00, execution_date=20250904T134148, start_date=20250904T134154, end_date=20250904T134230
[2025-09-04, 13:42:30 UTC] {taskinstance.py:341} ▶ Post task execution logs

Event Timeline

After setting container_resources to request 1 AMD GPU, the DAG fails as shown in the airflow logs below:

example-ml-training-pipeline-train-model-3t3x2tl3
 ▶ Log message source details
[2025-09-04, 14:53:04 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 14:53:05 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 14:53:05 UTC] {pod.py:1276} INFO - Building pod train-model-1zbbame with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T144543.8750550000-05c084189', 'kubernetes_pod_operator': 'True', 'try_number': '2'}
[2025-09-04, 14:53:05 UTC] {pod.py:573} INFO - Found matching pod train-model-1zbbame with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T144543.8750550000-05c084189', 'task_id': 'train_model', 'try_number': '2'}
[2025-09-04, 14:53:05 UTC] {pod.py:574} INFO - `try_number` of task_instance: 2
[2025-09-04, 14:53:05 UTC] {pod.py:575} INFO - `try_number` of pod: 2
[2025-09-04, 14:53:05 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-1zbbame to dse-k8s-worker1001.eqiad.wmnet from None
[2025-09-04, 14:53:05 UTC] {pod_manager.py:410} ▼ Waiting until 120s to get the POD scheduled...
[2025-09-04, 14:53:05 UTC] {pod_manager.py:431} INFO - Waiting 120s to get the POD running...
[2025-09-04, 14:53:10 UTC] {pod_manager.py:390} INFO - The Pod has an Event: AttachVolume.Attach succeeded for volume "pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116"  from None
[2025-09-04, 14:53:15 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Pulling image "docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476" from spec.containers{base}
[2025-09-04, 14:55:06 UTC] {pod_manager.py:434} ▲▲▲ Log group end
[2025-09-04, 14:55:06 UTC] {pod.py:1122} INFO - Deleting pod: train-model-1zbbame
[2025-09-04, 14:55:06 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 680, in execute_sync
    self.await_pod_start(pod=self.pod)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 611, in await_pod_start
    loop.run_until_complete(
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 435, in await_pod_start
    raise PodLaunchFailedException(
airflow.providers.cncf.kubernetes.utils.pod_manager.PodLaunchFailedException: Pod took too long to start. More than 120s. Check the pod events in kubernetes.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
    return ExecutionCallableRunner(
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
    return func(self, *args, **kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
    return self.execute_sync(context)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
    self.cleanup(
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
    raise AirflowException(
airflow.exceptions.AirflowException: Pod train-model-1zbbame returned a failure.
remote_pod: {'api_version': None,
 'kind': None,
 'metadata': {'annotations': {'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default',
                              'kubernetes.io/limit-ranger': 'LimitRanger '
                                                            'plugin set: cpu, '
                                                            'memory request '
                                                            'for container '
                                                            'base; cpu, memory '
                                                            'limit for '
                                                            'container base'},
              'creation_timestamp': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal()),
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': {'airflow_kpo_in_cluster': 'True',
                         'airflow_version': '2.10.5',
                         'app': 'airflow',
                         'component': 'task-pod',
                         'dag_id': 'example_ml_training_pipeline',
                         'kubernetes_pod_operator': 'True',
                         'release': 'dev-kevinbazira',
                         'routed_via': 'dev-kevinbazira',
                         'run_id': 'manual__2025-09-04T144543.8750550000-05c084189',
                         'task_id': 'train_model',
                         'try_number': '2'},
              'managed_fields': [{'api_version': 'v1',
                                  'fields_type': 'FieldsV1',
                                  'fields_v1': {'f:metadata': {'f:labels': {'.': {},
                                                                            'f:airflow_kpo_in_cluster': {},
                                                                            'f:airflow_version': {},
                                                                            'f:app': {},
                                                                            'f:component': {},
                                                                            'f:dag_id': {},
                                                                            'f:kubernetes_pod_operator': {},
                                                                            'f:release': {},
                                                                            'f:routed_via': {},
                                                                            'f:run_id': {},
                                                                            'f:task_id': {},
                                                                            'f:try_number': {}}},
                                                'f:spec': {'f:affinity': {},
                                                           'f:containers': {'k:{"name":"base"}': {'.': {},
                                                                                                  'f:args': {},
                                                                                                  'f:command': {},
                                                                                                  'f:env': {'.': {},
                                                                                                            'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
                                                                                                                                                              'f:name': {},
                                                                                                                                                              'f:value': {}},
                                                                                                            'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
                                                                                                                                                              'f:name': {},
                                                                                                                                                              'f:value': {}},
                                                                                                            'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
                                                                                                                                                'f:name': {},
                                                                                                                                                'f:value': {}}},
                                                                                                  'f:image': {},
                                                                                                  'f:imagePullPolicy': {},
                                                                                                  'f:name': {},
                                                                                                  'f:resources': {'.': {},
                                                                                                                  'f:limits': {'.': {},
                                                                                                                               'f:amd.com/gpu': {}},
                                                                                                                  'f:requests': {'.': {},
                                                                                                                                 'f:amd.com/gpu': {}}},
                                                                                                  'f:securityContext': {'.': {},
                                                                                                                        'f:allowPrivilegeEscalation': {},
                                                                                                                        'f:capabilities': {'.': {},
                                                                                                                                           'f:drop': {}},
                                                                                                                        'f:runAsNonRoot': {},
                                                                                                                        'f:seccompProfile': {'.': {},
                                                                                                                                             'f:type': {}}},
                                                                                                  'f:terminationMessagePath': {},
                                                                                                  'f:terminationMessagePolicy': {},
                                                                                                  'f:volumeMounts': {'.': {},
                                                                                                                     'k:{"mountPath":"/mnt/model-training"}': {'.': {},
                                                                                                                                                               'f:mountPath': {},
                                                                                                                                                               'f:name': {}}}}},
                                                           'f:dnsPolicy': {},
                                                           'f:enableServiceLinks': {},
                                                           'f:priorityClassName': {},
                                                           'f:restartPolicy': {},
                                                           'f:schedulerName': {},
                                                           'f:securityContext': {},
                                                           'f:terminationGracePeriodSeconds': {},
                                                           'f:volumes': {'.': {},
                                                                         'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
                                                                                                                           'f:name': {},
                                                                                                                           'f:persistentVolumeClaim': {'.': {},
                                                                                                                                                       'f:claimName': {}}}}}},
                                  'manager': 'OpenAPI-Generator',
                                  'operation': 'Update',
                                  'subresource': None,
                                  'time': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal())}],
              'name': 'train-model-1zbbame',
              'namespace': 'airflow-dev',
              'owner_references': None,
              'resource_version': '745273500',
              'self_link': None,
              'uid': '86c42517-62cc-4dd4-aca9-1e4fd06872f9'},
 'spec': {'active_deadline_seconds': None,
          'affinity': {'node_affinity': None,
                       'pod_affinity': None,
                       'pod_anti_affinity': None},
          'automount_service_account_token': None,
          'containers': [{'args': ['ls -l /mnt/model-training && python3 '
                                   'training/example/train/src/example/train_model.py '
                                   '--input-path '
                                   '/mnt/model-training/example_etl_output.parquet '
                                   '--output-model-path '
                                   '/mnt/model-training/model.pkl && ls -l '
                                   '/mnt/model-training'],
                          'command': ['bash', '-c'],
                          'env': [{'name': 'REQUESTS_CA_BUNDLE',
                                   'value': '/etc/ssl/certs/ca-certificates.crt',
                                   'value_from': None},
                                  {'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
                                   'value': 'WHEN_REQUIRED',
                                   'value_from': None},
                                  {'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
                                   'value': 'WHEN_REQUIRED',
                                   'value_from': None}],
                          'env_from': None,
                          'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
                          'image_pull_policy': 'IfNotPresent',
                          'lifecycle': None,
                          'liveness_probe': None,
                          'name': 'base',
                          'ports': None,
                          'readiness_probe': None,
                          'resize_policy': None,
                          'resources': {'claims': None,
                                        'limits': {'amd.com/gpu': '1',
                                                   'cpu': '100m',
                                                   'memory': '100Mi'},
                                        'requests': {'amd.com/gpu': '1',
                                                     'cpu': '100m',
                                                     'memory': '100Mi'}},
                          'restart_policy': None,
                          'security_context': {'allow_privilege_escalation': False,
                                               'app_armor_profile': None,
                                               'capabilities': {'add': None,
                                                                'drop': ['ALL']},
                                               'privileged': None,
                                               'proc_mount': None,
                                               'read_only_root_filesystem': None,
                                               'run_as_group': None,
                                               'run_as_non_root': True,
                                               'run_as_user': None,
                                               'se_linux_options': None,
                                               'seccomp_profile': {'localhost_profile': None,
                                                                   'type': 'RuntimeDefault'},
                                               'windows_options': None},
                          'startup_probe': None,
                          'stdin': None,
                          'stdin_once': None,
                          'termination_message_path': '/dev/termination-log',
                          'termination_message_policy': 'File',
                          'tty': None,
                          'volume_devices': None,
                          'volume_mounts': [{'mount_path': '/mnt/model-training',
                                             'mount_propagation': None,
                                             'name': 'airflow-ml-model-training-volume',
                                             'read_only': None,
                                             'recursive_read_only': None,
                                             'sub_path': None,
                                             'sub_path_expr': None},
                                            {'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
                                             'mount_propagation': None,
                                             'name': 'kube-api-access-clmn4',
                                             'read_only': True,
                                             'recursive_read_only': None,
                                             'sub_path': None,
                                             'sub_path_expr': None}],
                          'working_dir': None}],
          'dns_config': None,
          'dns_policy': 'ClusterFirst',
          'enable_service_links': True,
          'ephemeral_containers': None,
          'host_aliases': None,
          'host_ipc': None,
          'host_network': None,
          'host_pid': None,
          'host_users': None,
          'hostname': None,
          'image_pull_secrets': None,
          'init_containers': None,
          'node_name': 'dse-k8s-worker1001.eqiad.wmnet',
          'node_selector': None,
          'os': None,
          'overhead': None,
          'preemption_policy': 'PreemptLowerPriority',
          'priority': -100,
          'priority_class_name': 'low-priority-pod',
          'readiness_gates': None,
          'resource_claims': None,
          'resources': None,
          'restart_policy': 'Never',
          'runtime_class_name': None,
          'scheduler_name': 'default-scheduler',
          'scheduling_gates': None,
          'security_context': {'app_armor_profile': None,
                               'fs_group': None,
                               'fs_group_change_policy': None,
                               'run_as_group': None,
                               'run_as_non_root': None,
                               'run_as_user': None,
                               'se_linux_change_policy': None,
                               'se_linux_options': None,
                               'seccomp_profile': None,
                               'supplemental_groups': None,
                               'supplemental_groups_policy': None,
                               'sysctls': None,
                               'windows_options': None},
          'service_account': 'default',
          'service_account_name': 'default',
          'set_hostname_as_fqdn': None,
          'share_process_namespace': None,
          'subdomain': None,
          'termination_grace_period_seconds': 30,
          'tolerations': [{'effect': 'NoExecute',
                           'key': 'node.kubernetes.io/not-ready',
                           'operator': 'Exists',
                           'toleration_seconds': 300,
                           'value': None},
                          {'effect': 'NoExecute',
                           'key': 'node.kubernetes.io/unreachable',
                           'operator': 'Exists',
                           'toleration_seconds': 300,
                           'value': None}],
          'topology_spread_constraints': None,
          'volumes': [{'aws_elastic_block_store': None,
                       'azure_disk': None,
                       'azure_file': None,
                       'cephfs': None,
                       'cinder': None,
                       'config_map': None,
                       'csi': None,
                       'downward_api': None,
                       'empty_dir': None,
                       'ephemeral': None,
                       'fc': None,
                       'flex_volume': None,
                       'flocker': None,
                       'gce_persistent_disk': None,
                       'git_repo': None,
                       'glusterfs': None,
                       'host_path': None,
                       'image': None,
                       'iscsi': None,
                       'name': 'airflow-ml-model-training-volume',
                       'nfs': None,
                       'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
                                                   'read_only': None},
                       'photon_persistent_disk': None,
                       'portworx_volume': None,
                       'projected': None,
                       'quobyte': None,
                       'rbd': None,
                       'scale_io': None,
                       'secret': None,
                       'storageos': None,
                       'vsphere_volume': None},
                      {'aws_elastic_block_store': None,
                       'azure_disk': None,
                       'azure_file': None,
                       'cephfs': None,
                       'cinder': None,
                       'config_map': None,
                       'csi': None,
                       'downward_api': None,
                       'empty_dir': None,
                       'ephemeral': None,
                       'fc': None,
                       'flex_volume': None,
                       'flocker': None,
                       'gce_persistent_disk': None,
                       'git_repo': None,
                       'glusterfs': None,
                       'host_path': None,
                       'image': None,
                       'iscsi': None,
                       'name': 'kube-api-access-clmn4',
                       'nfs': None,
                       'persistent_volume_claim': None,
                       'photon_persistent_disk': None,
                       'portworx_volume': None,
                       'projected': {'default_mode': 420,
                                     'sources': [{'cluster_trust_bundle': None,
                                                  'config_map': None,
                                                  'downward_api': None,
                                                  'secret': None,
                                                  'service_account_token': {'audience': None,
                                                                            'expiration_seconds': 3607,
                                                                            'path': 'token'}},
                                                 {'cluster_trust_bundle': None,
                                                  'config_map': {'items': [{'key': 'ca.crt',
                                                                            'mode': None,
                                                                            'path': 'ca.crt'}],
                                                                 'name': 'kube-root-ca.crt',
                                                                 'optional': None},
                                                  'downward_api': None,
                                                  'secret': None,
                                                  'service_account_token': None},
                                                 {'cluster_trust_bundle': None,
                                                  'config_map': None,
                                                  'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
                                                                                            'field_path': 'metadata.namespace'},
                                                                              'mode': None,
                                                                              'path': 'namespace',
                                                                              'resource_field_ref': None}]},
                                                  'secret': None,
                                                  'service_account_token': None}]},
                       'quobyte': None,
                       'rbd': None,
                       'scale_io': None,
                       'secret': None,
                       'storageos': None,
                       'vsphere_volume': None}]},
 'status': {'conditions': [{'last_probe_time': None,
                            'last_transition_time': datetime.datetime(2025, 9, 4, 14, 53, 5, tzinfo=tzlocal()),
                            'message': None,
                            'reason': None,
                            'status': 'True',
                            'type': 'PodScheduled'}],
            'container_statuses': None,
            'ephemeral_container_statuses': None,
            'host_i_ps': None,
            'host_ip': None,
            'init_container_statuses': None,
            'message': None,
            'nominated_node_name': None,
            'phase': 'Pending',
            'pod_i_ps': None,
            'pod_ip': None,
            'qos_class': 'Guaranteed',
            'reason': None,
            'resize': None,
            'resource_claim_statuses': None,
            'start_time': None}}
[2025-09-04, 14:55:06 UTC] {taskinstance.py:1226} INFO - Marking task as UP_FOR_RETRY. dag_id=example_ml_training_pipeline, task_id=train_model, run_id=manual__2025-09-04T14:45:43.875055+00:00, execution_date=20250904T144543, start_date=20250904T145304, end_date=20250904T145506
[2025-09-04, 14:55:06 UTC] {taskinstance.py:341} ▶ Post task execution logs

I started the DAG while streaming changes, and it looks like the WMFKubernetesPodOperator creates the task pod, but the pod created by the task runs into a memory issue:

kevinbazira@deploy1003:~$ kube_env airflow-dev dse-k8s-eqiad
kevinbazira@deploy1003:~$ kubectl get pods -w
NAME                                                              READY   STATUS             RESTARTS        AGE
...
example-ml-training-pipeline-train-model-0mkmkk47                 0/1     Pending                 0               0s
example-ml-training-pipeline-train-model-0mkmkk47                 0/1     Pending                 0               0s
example-ml-training-pipeline-train-model-0mkmkk47                 0/1     ContainerCreating       0               1s
example-ml-training-pipeline-train-model-0mkmkk47                 0/1     ContainerCreating       0               13s
example-ml-training-pipeline-train-model-0mkmkk47                 1/1     Running                 0               13s
train-model-ybumr3o                                               0/1     Pending                 0               0s
train-model-ybumr3o                                               0/1     Pending                 0               0s
train-model-ybumr3o                                               0/1     ContainerCreating       0               0s
train-model-ybumr3o                                               0/1     ContainerCreating       0               10s
train-model-ybumr3o                                               1/1     Running                 0               11s
train-model-ybumr3o                                               0/1     OOMKilled               0               20s
train-model-ybumr3o                                               0/1     OOMKilled               0               21s
train-model-ybumr3o                                               0/1     OOMKilled               0               22s
train-model-ybumr3o                                               0/1     OOMKilled               0               23s
train-model-ybumr3o                                               0/1     Terminating             0               23s
train-model-ybumr3o                                               0/1     Terminating             0               23s
...

After setting container_resources for CPU, memory, and GPU, the DAG succeeded as shown below:

example-ml-training-pipeline-train-model-fnlyouii
 ▶ Log message source details
[2025-09-04, 16:50:17 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 16:50:18 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 16:50:18 UTC] {pod.py:1276} INFO - Building pod train-model-bc4ar60 with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T165010.5003430000-03df633e4', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2025-09-04, 16:50:18 UTC] {pod.py:573} INFO - Found matching pod train-model-bc4ar60 with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T165010.5003430000-03df633e4', 'task_id': 'train_model', 'try_number': '1'}
[2025-09-04, 16:50:18 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
[2025-09-04, 16:50:18 UTC] {pod.py:575} INFO - `try_number` of pod: 1
[2025-09-04, 16:50:18 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-bc4ar60 to dse-k8s-worker1001.eqiad.wmnet from None
[2025-09-04, 16:50:18 UTC] {pod_manager.py:410} ▼ Waiting until 120s to get the POD scheduled...
[2025-09-04, 16:50:18 UTC] {pod_manager.py:431} INFO - Waiting 120s to get the POD running...
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: AttachVolume.Attach succeeded for volume "pvc-8a6a2920-8d7e-4616-8ab6-a6a70b26d116"  from None
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Container image "docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476" already present on machine from spec.containers{base}
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Created container base from spec.containers{base}
[2025-09-04, 16:50:23 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Started container base from spec.containers{base}
[2025-09-04, 16:50:28 UTC] {pod_manager.py:418} ▲▲▲ Log group end
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser  4096 Sep  4 09:37 example_etl_output.parquet
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root    runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser    85 Sep  4 12:25 model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser     0 Sep  4 09:37 test_write.txt
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 3 runuser runuser  4096 Aug 28 13:00 training
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA is available.
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA device count: 1
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] CUDA device name: AMD Radeon Pro WX 9100
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] Loaded DataFrame with shape: (200, 4)
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] Dummy model artifact written to /mnt/model-training/model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser  4096 Sep  4 09:37 example_etl_output.parquet
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root    runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser    85 Sep  4 16:50 model.pkl
[2025-09-04, 16:50:33 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser     0 Sep  4 09:37 test_write.txt
[2025-09-04, 16:50:33 UTC] {pod_manager.py:555} INFO - [base] drwxrwsr-x 3 runuser runuser  4096 Aug 28 13:00 training
[2025-09-04, 16:50:33 UTC] {pod.py:1122} INFO - Deleting pod: train-model-bc4ar60
[2025-09-04, 16:50:33 UTC] {taskinstance.py:341} ▶ Post task execution logs

Huge thanks to @brouberol, @klausman, and @elukey for helping with this <3