Page MenuHomePhabricator

Pod started by WMFKubernetesPodOperator has no access to AMD GPUs

Authored By
kevinbazira
Sep 4 2025, 1:49 PM
Size
37 KB
Referenced Files
None
Subscribers
None

Pod started by WMFKubernetesPodOperator has no access to AMD GPUs

example-ml-training-pipeline-train-model-h0heesi0
▶ Log message source details
[2025-09-04, 13:41:54 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2025-09-04, 13:41:54 UTC] {crypto.py:82} WARNING - empty cryptography key - values will not be stored encrypted.
[2025-09-04, 13:41:54 UTC] {pod.py:1276} INFO - Building pod train-model-u726z1o with labels: {'dag_id': 'example_ml_training_pipeline', 'task_id': 'train_model', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:573} INFO - Found matching pod train-model-u726z1o with labels {'airflow_kpo_in_cluster': 'True', 'airflow_version': '2.10.5', 'app': 'airflow', 'component': 'task-pod', 'dag_id': 'example_ml_training_pipeline', 'kubernetes_pod_operator': 'True', 'release': 'dev-kevinbazira', 'routed_via': 'dev-kevinbazira', 'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737', 'task_id': 'train_model', 'try_number': '1'}
[2025-09-04, 13:41:54 UTC] {pod.py:574} INFO - `try_number` of task_instance: 1
[2025-09-04, 13:41:54 UTC] {pod.py:575} INFO - `try_number` of pod: 1
[2025-09-04, 13:41:55 UTC] {pod_manager.py:390} INFO - The Pod has an Event: Successfully assigned airflow-dev/train-model-u726z1o to dse-k8s-worker1017.eqiad.wmnet from None
[2025-09-04, 13:41:55 UTC] {pod_manager.py:410} ▶ Waiting until 120s to get the POD scheduled...
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] total 28
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 2 runuser runuser 4096 Sep 4 09:37 example_etl_output.parquet
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrws--- 2 root runuser 16384 Aug 28 12:59 lost+found
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 85 Sep 4 12:25 model.pkl
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] -rw-rw-r-- 1 runuser runuser 0 Sep 4 09:37 test_write.txt
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] drwxrwsr-x 3 runuser runuser 4096 Aug 28 13:00 training
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] Traceback (most recent call last):
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ROCm device count: 0
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] main()
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 20, in main
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] print(f"ROCm device name: {torch.cuda.get_device_name(0)}")
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:25 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:26 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:26 UTC] {pod_manager.py:582} WARNING - Pod train-model-u726z1o log read interrupted but container base still running. Logs generated in the last one second might get duplicated.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] Traceback (most recent call last):
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 40, in <module>
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ROCm (AMD GPU) is available.
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ROCm device count: 0
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] main()
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/srv/example/training/example/train/src/example/train_model.py", line 20, in main
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] print(f"ROCm device name: {torch.cuda.get_device_name(0)}")
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 493, in get_device_name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] return get_device_properties(device).name
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 523, in get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] _lazy_init() # will define _get_device_properties
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] ^^^^^^^^^^^^
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] File "/opt/lib/python/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
[2025-09-04, 13:42:27 UTC] {pod_manager.py:536} INFO - [base] torch._C._cuda_init()
[2025-09-04, 13:42:27 UTC] {pod_manager.py:555} INFO - [base] RuntimeError: No HIP GPUs are available
[2025-09-04, 13:42:28 UTC] {pod_manager.py:714} INFO - Pod train-model-u726z1o has phase Running
[2025-09-04, 13:42:30 UTC] {pod.py:1122} INFO - Deleting pod: train-model-u726z1o
[2025-09-04, 13:42:30 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
return ExecutionCallableRunner(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
return self.func(*args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
return func(self, *args, **kwargs)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
return self.execute_sync(context)
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
self.cleanup(
File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
raise AirflowException(
airflow.exceptions.AirflowException: Pod train-model-u726z1o returned a failure.
remote_pod: {'api_version': 'v1',
'kind': 'Pod',
'metadata': {'annotations': {'cni.projectcalico.org/containerID': '511bb84917686605961322ad194937cd6d0807ecd75c88b651aa03bbabc5be92',
'cni.projectcalico.org/podIP': '',
'cni.projectcalico.org/podIPs': '',
'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default'},
'creation_timestamp': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'deletion_grace_period_seconds': None,
'deletion_timestamp': None,
'finalizers': None,
'generate_name': None,
'generation': None,
'labels': {'airflow_kpo_in_cluster': 'True',
'airflow_version': '2.10.5',
'app': 'airflow',
'component': 'task-pod',
'dag_id': 'example_ml_training_pipeline',
'kubernetes_pod_operator': 'True',
'release': 'dev-kevinbazira',
'routed_via': 'dev-kevinbazira',
'run_id': 'manual__2025-09-04T134148.5134530000-ac5d4c737',
'task_id': 'train_model',
'try_number': '1'},
'managed_fields': [{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:labels': {'.': {},
'f:airflow_kpo_in_cluster': {},
'f:airflow_version': {},
'f:app': {},
'f:component': {},
'f:dag_id': {},
'f:kubernetes_pod_operator': {},
'f:release': {},
'f:routed_via': {},
'f:run_id': {},
'f:task_id': {},
'f:try_number': {}}},
'f:spec': {'f:affinity': {},
'f:containers': {'k:{"name":"base"}': {'.': {},
'f:args': {},
'f:command': {},
'f:env': {'.': {},
'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
'f:name': {},
'f:value': {}},
'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
'f:name': {},
'f:value': {}}},
'f:image': {},
'f:imagePullPolicy': {},
'f:name': {},
'f:resources': {'.': {},
'f:limits': {'.': {},
'f:cpu': {},
'f:memory': {}},
'f:requests': {'.': {},
'f:cpu': {},
'f:memory': {}}},
'f:securityContext': {'.': {},
'f:allowPrivilegeEscalation': {},
'f:capabilities': {'.': {},
'f:drop': {}},
'f:runAsNonRoot': {},
'f:seccompProfile': {'.': {},
'f:type': {}}},
'f:terminationMessagePath': {},
'f:terminationMessagePolicy': {},
'f:volumeMounts': {'.': {},
'k:{"mountPath":"/mnt/model-training"}': {'.': {},
'f:mountPath': {},
'f:name': {}}}}},
'f:dnsPolicy': {},
'f:enableServiceLinks': {},
'f:priorityClassName': {},
'f:restartPolicy': {},
'f:schedulerName': {},
'f:securityContext': {},
'f:terminationGracePeriodSeconds': {},
'f:volumes': {'.': {},
'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
'f:name': {},
'f:persistentVolumeClaim': {'.': {},
'f:claimName': {}}}}}},
'manager': 'OpenAPI-Generator',
'operation': 'Update',
'subresource': None,
'time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal())},
{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:metadata': {'f:annotations': {'f:cni.projectcalico.org/containerID': {},
'f:cni.projectcalico.org/podIP': {},
'f:cni.projectcalico.org/podIPs': {}}}},
'manager': 'Go-http-client',
'operation': 'Update',
'subresource': 'status',
'time': datetime.datetime(2025, 9, 4, 13, 42, 3, tzinfo=tzlocal())},
{'api_version': 'v1',
'fields_type': 'FieldsV1',
'fields_v1': {'f:status': {'f:conditions': {'k:{"type":"ContainersReady"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:reason': {},
'f:status': {},
'f:type': {}},
'k:{"type":"Initialized"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:status': {},
'f:type': {}},
'k:{"type":"Ready"}': {'.': {},
'f:lastProbeTime': {},
'f:lastTransitionTime': {},
'f:reason': {},
'f:status': {},
'f:type': {}}},
'f:containerStatuses': {},
'f:hostIP': {},
'f:phase': {},
'f:podIP': {},
'f:podIPs': {'.': {},
'k:{"ip":"10.67.30.60"}': {'.': {},
'f:ip': {}},
'k:{"ip":"2620:0:861:302:677:6c3a:553b:5fc"}': {'.': {},
'f:ip': {}}},
'f:startTime': {}}},
'manager': 'kubelet',
'operation': 'Update',
'subresource': 'status',
'time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal())}],
'name': 'train-model-u726z1o',
'namespace': 'airflow-dev',
'owner_references': None,
'resource_version': '745183545',
'self_link': None,
'uid': '6a3ae65b-7f38-4a48-b9ac-fb419e847320'},
'spec': {'active_deadline_seconds': None,
'affinity': {'node_affinity': None,
'pod_affinity': None,
'pod_anti_affinity': None},
'automount_service_account_token': None,
'containers': [{'args': ['ls -l /mnt/model-training && python3 '
'training/example/train/src/example/train_model.py '
'--input-path '
'/mnt/model-training/example_etl_output.parquet '
'--output-model-path '
'/mnt/model-training/model.pkl && ls -l '
'/mnt/model-training'],
'command': ['bash', '-c'],
'env': [{'name': 'REQUESTS_CA_BUNDLE',
'value': '/etc/ssl/certs/ca-certificates.crt',
'value_from': None},
{'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
'value': 'WHEN_REQUIRED',
'value_from': None},
{'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
'value': 'WHEN_REQUIRED',
'value_from': None}],
'env_from': None,
'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
'image_pull_policy': 'IfNotPresent',
'lifecycle': None,
'liveness_probe': None,
'name': 'base',
'ports': None,
'readiness_probe': None,
'resize_policy': None,
'resources': {'claims': None,
'limits': {'cpu': '2', 'memory': '3Gi'},
'requests': {'cpu': '1',
'memory': '1500Mi'}},
'restart_policy': None,
'security_context': {'allow_privilege_escalation': False,
'app_armor_profile': None,
'capabilities': {'add': None,
'drop': ['ALL']},
'privileged': None,
'proc_mount': None,
'read_only_root_filesystem': None,
'run_as_group': None,
'run_as_non_root': True,
'run_as_user': None,
'se_linux_options': None,
'seccomp_profile': {'localhost_profile': None,
'type': 'RuntimeDefault'},
'windows_options': None},
'startup_probe': None,
'stdin': None,
'stdin_once': None,
'termination_message_path': '/dev/termination-log',
'termination_message_policy': 'File',
'tty': None,
'volume_devices': None,
'volume_mounts': [{'mount_path': '/mnt/model-training',
'mount_propagation': None,
'name': 'airflow-ml-model-training-volume',
'read_only': None,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None},
{'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
'mount_propagation': None,
'name': 'kube-api-access-t2jw5',
'read_only': True,
'recursive_read_only': None,
'sub_path': None,
'sub_path_expr': None}],
'working_dir': None}],
'dns_config': None,
'dns_policy': 'ClusterFirst',
'enable_service_links': True,
'ephemeral_containers': None,
'host_aliases': None,
'host_ipc': None,
'host_network': None,
'host_pid': None,
'host_users': None,
'hostname': None,
'image_pull_secrets': None,
'init_containers': None,
'node_name': 'dse-k8s-worker1017.eqiad.wmnet',
'node_selector': None,
'os': None,
'overhead': None,
'preemption_policy': 'PreemptLowerPriority',
'priority': -100,
'priority_class_name': 'low-priority-pod',
'readiness_gates': None,
'resource_claims': None,
'resources': None,
'restart_policy': 'Never',
'runtime_class_name': None,
'scheduler_name': 'default-scheduler',
'scheduling_gates': None,
'security_context': {'app_armor_profile': None,
'fs_group': None,
'fs_group_change_policy': None,
'run_as_group': None,
'run_as_non_root': None,
'run_as_user': None,
'se_linux_change_policy': None,
'se_linux_options': None,
'seccomp_profile': None,
'supplemental_groups': None,
'supplemental_groups_policy': None,
'sysctls': None,
'windows_options': None},
'service_account': 'default',
'service_account_name': 'default',
'set_hostname_as_fqdn': None,
'share_process_namespace': None,
'subdomain': None,
'termination_grace_period_seconds': 30,
'tolerations': [{'effect': 'NoExecute',
'key': 'node.kubernetes.io/not-ready',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None},
{'effect': 'NoExecute',
'key': 'node.kubernetes.io/unreachable',
'operator': 'Exists',
'toleration_seconds': 300,
'value': None}],
'topology_spread_constraints': None,
'volumes': [{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'airflow-ml-model-training-volume',
'nfs': None,
'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
'read_only': None},
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': None,
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None},
{'aws_elastic_block_store': None,
'azure_disk': None,
'azure_file': None,
'cephfs': None,
'cinder': None,
'config_map': None,
'csi': None,
'downward_api': None,
'empty_dir': None,
'ephemeral': None,
'fc': None,
'flex_volume': None,
'flocker': None,
'gce_persistent_disk': None,
'git_repo': None,
'glusterfs': None,
'host_path': None,
'image': None,
'iscsi': None,
'name': 'kube-api-access-t2jw5',
'nfs': None,
'persistent_volume_claim': None,
'photon_persistent_disk': None,
'portworx_volume': None,
'projected': {'default_mode': 420,
'sources': [{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': None,
'secret': None,
'service_account_token': {'audience': None,
'expiration_seconds': 3607,
'path': 'token'}},
{'cluster_trust_bundle': None,
'config_map': {'items': [{'key': 'ca.crt',
'mode': None,
'path': 'ca.crt'}],
'name': 'kube-root-ca.crt',
'optional': None},
'downward_api': None,
'secret': None,
'service_account_token': None},
{'cluster_trust_bundle': None,
'config_map': None,
'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
'field_path': 'metadata.namespace'},
'mode': None,
'path': 'namespace',
'resource_field_ref': None}]},
'secret': None,
'service_account_token': None}]},
'quobyte': None,
'rbd': None,
'scale_io': None,
'secret': None,
'storageos': None,
'vsphere_volume': None}]},
'status': {'conditions': [{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'Initialized'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal()),
'message': None,
'reason': 'PodFailed',
'status': 'False',
'type': 'Ready'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 42, 27, tzinfo=tzlocal()),
'message': None,
'reason': 'PodFailed',
'status': 'False',
'type': 'ContainersReady'},
{'last_probe_time': None,
'last_transition_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal()),
'message': None,
'reason': None,
'status': 'True',
'type': 'PodScheduled'}],
'container_statuses': [{'allocated_resources': None,
'allocated_resources_status': None,
'container_id': 'containerd://cc4317edd3504105a422701df2822c10ba9b96c335e929b87af20bedfdb782b8',
'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-605476',
'image_id': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines@sha256:65f31f2bf9cf539521a2a67dec6352dcf59749eeaeec6918c41db6efa595250a',
'last_state': {'running': None,
'terminated': None,
'waiting': None},
'name': 'base',
'ready': False,
'resources': None,
'restart_count': 0,
'started': False,
'state': {'running': None,
'terminated': {'container_id': 'containerd://cc4317edd3504105a422701df2822c10ba9b96c335e929b87af20bedfdb782b8',
'exit_code': 1,
'finished_at': datetime.datetime(2025, 9, 4, 13, 42, 26, tzinfo=tzlocal()),
'message': None,
'reason': 'Error',
'signal': None,
'started_at': datetime.datetime(2025, 9, 4, 13, 42, 18, tzinfo=tzlocal())},
'waiting': None},
'user': None,
'volume_mounts': None}],
'ephemeral_container_statuses': None,
'host_i_ps': None,
'host_ip': '10.64.16.210',
'init_container_statuses': None,
'message': None,
'nominated_node_name': None,
'phase': 'Failed',
'pod_i_ps': [{'ip': '10.67.30.60'},
{'ip': '2620:0:861:302:677:6c3a:553b:5fc'}],
'pod_ip': '10.67.30.60',
'qos_class': 'Burstable',
'reason': None,
'resize': None,
'resource_claim_statuses': None,
'start_time': datetime.datetime(2025, 9, 4, 13, 41, 54, tzinfo=tzlocal())}}
[2025-09-04, 13:42:30 UTC] {taskinstance.py:1226} INFO - Marking task as UP_FOR_RETRY. dag_id=example_ml_training_pipeline, task_id=train_model, run_id=manual__2025-09-04T13:41:48.513453+00:00, execution_date=20250904T134148, start_date=20250904T134154, end_date=20250904T134230
[2025-09-04, 13:42:30 UTC] {taskinstance.py:341} ▶ Post task execution logs

File Metadata

Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
21759301
Default Alt Text
Pod started by WMFKubernetesPodOperator has no access to AMD GPUs (37 KB)

Event Timeline