Page MenuHomePhabricator

Orchestrate end-to-end tone-check pipeline using the TriggerDagRunOperator
Closed, ResolvedPublic

Description

Update the three tone-check DAGs (data-generation, data-copying, and model-training) into a single, end-to-end orchestrated pipeline using the TriggerDagRunOperator.

In this task we will configure:

  1. tone_check_data_generation_dag to trigger the tone_check_copy_hdfs_to_pvc_dag upon completion, passing a timestamped HDFS data path via its configuration.
  2. tone_check_copy_hdfs_to_pvc_dag to read the passed HDFS path, create a corresponding versioned path on the PVC, perform the copy, and then trigger the tone_check_retrain_dag.
  3. tone_check_retrain_dag to read the passed PVC path that includes the trainind data and also use it as the trained model output path.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
ml: Enable deferrable execution of training operator to handle resource contentionrepos/data-engineering/airflow-dags!1731kevinbazirafix_tone_check_copy_hdfs_to_pvc_dagmain
ml: test TriggerDagRunOperator with stripped down tone-check DAGsrepos/data-engineering/airflow-dags!1726kevinbazirafix_tone_check_copy_hdfs_to_pvc_dagmain
ml: orchestrate DAGs with TriggerDagRunOperatorrepos/data-engineering/airflow-dags!1725kevinbazirause_prod_hdfs_pathmain
Customize query in GitLab

Related Objects

Event Timeline

After updating the three DAGs to run as a single orchestrated pipeline using the TriggerDagRunOperator, I triggered the pipeline and the tone_check_data_generation_dag completed successfully:

$ hdfs dfs -ls /tmp/ml/tone_check/20251003T102031
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 4 items
drwxr-x---   - analytics-ml analytics-privatedata-users          0 2025-10-03 14:25 /tmp/ml/tone_check/20251003T102031/full_model_ready_data.parquet
drwxr-x---   - analytics-ml analytics-privatedata-users          0 2025-10-03 16:29 /tmp/ml/tone_check/20251003T102031/test_data.parquet
drwxr-x---   - analytics-ml analytics-privatedata-users          0 2025-10-03 16:28 /tmp/ml/tone_check/20251003T102031/train_data.parquet
drwxr-x---   - analytics-ml analytics-privatedata-users          0 2025-10-03 16:29 /tmp/ml/tone_check/20251003T102031/validation_data.parque

The data generation DAG triggered the copy DAG but the tone_check_copy_hdfs_to_pvc_dag failed with the error below:

1[2025-10-03, 16:58:33 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
2Traceback (most recent call last):
3 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 680, in execute_sync
4 self.await_pod_start(pod=self.pod)
5 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 619, in await_pod_start
6 loop.run_until_complete(events_task)
7 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
8 return future.result()
9 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 387, in watch_pod_events
10 events = self.read_pod_events(pod)
11 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 338, in wrapped_f
12 return copy(f, *args, **kw)
13 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 477, in __call__
14 do = self.iter(retry_state=retry_state)
15 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 378, in iter
16 result = action(retry_state)
17 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 420, in exc_check
18 raise retry_exc.reraise()
19 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 187, in reraise
20 raise self.last_attempt.result()
21 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 451, in result
22 return self.__get_result()
23 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
24 raise self._exception
25 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/tenacity/__init__.py", line 480, in __call__
26 result = fn(*args, **kwargs)
27 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 814, in read_pod_events
28 return self._client.list_namespaced_event(
29 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15461, in list_namespaced_event
30 return self.list_namespaced_event_with_http_info(namespace, **kwargs) # noqa: E501
31 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/api/core_v1_api.py", line 15580, in list_namespaced_event_with_http_info
32 return self.api_client.call_api(
33 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
34 return self.__call_api(resource_path, method,
35 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
36 response_data = self.request(
37 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request
38 return self.rest_client.GET(url,
39 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/rest.py", line 244, in GET
40 return self.request("GET", url,
41 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/kubernetes/client/rest.py", line 238, in request
42 raise ApiException(http_resp=r)
43kubernetes.client.exceptions.ApiException: (403)
44Reason: Forbidden
45HTTP response headers: HTTPHeaderDict({'Audit-Id': '8f2e38fe-e84c-4464-87b7-ee511b7962cf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'c61211d6-8e62-4684-8f2d-3137b5d5c5dd', 'X-Kubernetes-Pf-Prioritylevel-Uid': '0b1de1a9-d62d-4e1d-8b6d-75ef91c83d5f', 'Date': 'Fri, 03 Oct 2025 16:58:23 GMT', 'Content-Length': '294'})
46HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"events is forbidden: User \"system:serviceaccount:airflow-ml:airflow\" cannot list resource \"events\" in API group \"\" in the namespace \"airflow-ml\"","reason":"Forbidden","details":{"kind":"events"},"code":403}
47During handling of the above exception, another exception occurred:
48Traceback (most recent call last):
49 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
50 result = _execute_callable(context=context, **execute_callable_kwargs)
51 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
52 return ExecutionCallableRunner(
53 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
54 return self.func(*args, **kwargs)
55 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
56 return func(self, *args, **kwargs)
57 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 640, in execute
58 return self.execute_sync(context)
59 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 721, in execute_sync
60 self.cleanup(
61 File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 1053, in cleanup
62 raise AirflowException(
63airflow.exceptions.AirflowException: Pod copy-hdfs-to-pvc-n1hn33t returned a failure.
64remote_pod: {'api_version': None,
65 'kind': None,
66 'metadata': {'annotations': {'container.seccomp.security.alpha.kubernetes.io/base': 'runtime/default'},
67 'creation_timestamp': datetime.datetime(2025, 10, 3, 16, 58, 20, tzinfo=tzlocal()),
68 'deletion_grace_period_seconds': None,
69 'deletion_timestamp': None,
70 'finalizers': None,
71 'generate_name': None,
72 'generation': None,
73 'labels': {'airflow_kpo_in_cluster': 'True',
74 'airflow_version': '2.10.5',
75 'app': 'airflow',
76 'component': 'task-pod',
77 'dag_id': 'tone_check_copy_hdfs_to_pvc_pipeline',
78 'kubernetes_pod_operator': 'True',
79 'release': 'production',
80 'routed_via': 'production',
81 'run_id': 'manual__2025-10-03T163006.0701010000-85a29e92c',
82 'task_id': 'copy_hdfs_to_pvc',
83 'try_number': '6'},
84 'managed_fields': [{'api_version': 'v1',
85 'fields_type': 'FieldsV1',
86 'fields_v1': {'f:metadata': {'f:labels': {'.': {},
87 'f:airflow_kpo_in_cluster': {},
88 'f:airflow_version': {},
89 'f:app': {},
90 'f:component': {},
91 'f:dag_id': {},
92 'f:kubernetes_pod_operator': {},
93 'f:release': {},
94 'f:routed_via': {},
95 'f:run_id': {},
96 'f:task_id': {},
97 'f:try_number': {}}},
98 'f:spec': {'f:affinity': {},
99 'f:containers': {'k:{"name":"base"}': {'.': {},
100 'f:args': {},
101 'f:command': {},
102 'f:env': {'.': {},
103 'k:{"name":"AWS_REQUEST_CHECKSUM_CALCULATION"}': {'.': {},
104 'f:name': {},
105 'f:value': {}},
106 'k:{"name":"AWS_RESPONSE_CHECKSUM_VALIDATION"}': {'.': {},
107 'f:name': {},
108 'f:value': {}},
109 'k:{"name":"HADOOP_CONF_DIR"}': {'.': {},
110 'f:name': {},
111 'f:value': {}},
112 'k:{"name":"HIVE_CONF_DIR"}': {'.': {},
113 'f:name': {},
114 'f:value': {}},
115 'k:{"name":"KRB5CCNAME"}': {'.': {},
116 'f:name': {},
117 'f:value': {}},
118 'k:{"name":"KRB5_CONFIG"}': {'.': {},
119 'f:name': {},
120 'f:value': {}},
121 'k:{"name":"REQUESTS_CA_BUNDLE"}': {'.': {},
122 'f:name': {},
123 'f:value': {}}},
124 'f:image': {},
125 'f:imagePullPolicy': {},
126 'f:name': {},
127 'f:resources': {'.': {},
128 'f:limits': {'.': {},
129 'f:cpu': {},
130 'f:memory': {}},
131 'f:requests': {'.': {},
132 'f:cpu': {},
133 'f:memory': {}}},
134 'f:securityContext': {'.': {},
135 'f:allowPrivilegeEscalation': {},
136 'f:capabilities': {'.': {},
137 'f:drop': {}},
138 'f:runAsNonRoot': {},
139 'f:seccompProfile': {'.': {},
140 'f:type': {}}},
141 'f:terminationMessagePath': {},
142 'f:terminationMessagePolicy': {},
143 'f:volumeMounts': {'.': {},
144 'k:{"mountPath":"/etc/hadoop/conf"}': {'.': {},
145 'f:mountPath': {},
146 'f:name': {}},
147 'k:{"mountPath":"/etc/krb5.conf"}': {'.': {},
148 'f:mountPath': {},
149 'f:name': {},
150 'f:subPath': {}},
151 'k:{"mountPath":"/mnt/model-training"}': {'.': {},
152 'f:mountPath': {},
153 'f:name': {}},
154 'k:{"mountPath":"/tmp/airflow_krb5_ccache"}': {'.': {},
155 'f:mountPath': {},
156 'f:name': {}}}}},
157 'f:dnsPolicy': {},
158 'f:enableServiceLinks': {},
159 'f:priorityClassName': {},
160 'f:restartPolicy': {},
161 'f:schedulerName': {},
162 'f:securityContext': {'.': {},
163 'f:fsGroup': {}},
164 'f:terminationGracePeriodSeconds': {},
165 'f:volumes': {'.': {},
166 'k:{"name":"airflow-hadoop-configuration"}': {'.': {},
167 'f:configMap': {'.': {},
168 'f:defaultMode': {},
169 'f:name': {}},
170 'f:name': {}},
171 'k:{"name":"airflow-kerberos-client-config"}': {'.': {},
172 'f:configMap': {'.': {},
173 'f:defaultMode': {},
174 'f:name': {}},
175 'f:name': {}},
176 'k:{"name":"airflow-kerberos-token"}': {'.': {},
177 'f:name': {},
178 'f:persistentVolumeClaim': {'.': {},
179 'f:claimName': {}}},
180 'k:{"name":"airflow-ml-model-training-volume"}': {'.': {},
181 'f:name': {},
182 'f:persistentVolumeClaim': {'.': {},
183 'f:claimName': {}}}}}},
184 'manager': 'OpenAPI-Generator',
185 'operation': 'Update',
186 'subresource': None,
187 'time': datetime.datetime(2025, 10, 3, 16, 58, 20, tzinfo=tzlocal())}],
188 'name': 'copy-hdfs-to-pvc-n1hn33t',
189 'namespace': 'airflow-ml',
190 'owner_references': None,
191 'resource_version': '803348143',
192 'self_link': None,
193 'uid': '66835c27-1fe9-4626-b9df-b90177b1849d'},
194 'spec': {'active_deadline_seconds': None,
195 'affinity': {'node_affinity': None,
196 'pod_affinity': None,
197 'pod_anti_affinity': None},
198 'automount_service_account_token': None,
199 'containers': [{'args': ['\n'
200 "echo 'Listing contents of PVC mount: "
201 "/mnt/model-training'\n"
202 'ls -l /mnt/model-training\n'
203 "echo 'Listing contents of HDFS input path: "
204 "/tmp/ml/tone_check/20251003T102031'\n"
205 'hdfs dfs -ls '
206 '/tmp/ml/tone_check/20251003T102031\n'
207 "echo 'Creating local output directory if "
208 "missing: /mnt/model-training/'\n"
209 'mkdir -p /mnt/model-training/\n'
210 "echo 'Copying from HDFS to PVC: "
211 '/tmp/ml/tone_check/20251003T102031 -> '
212 "/mnt/model-training/'\n"
213 'hdfs dfs -get '
214 '/tmp/ml/tone_check/20251003T102031 '
215 '/mnt/model-training/\n'
216 "echo 'Listing contents of local output "
217 "after copy: /mnt/model-training/'\n"
218 'ls -l /mnt/model-training/'],
219 'command': ['bash', '-c'],
220 'env': [{'name': 'REQUESTS_CA_BUNDLE',
221 'value': '/etc/ssl/certs/ca-certificates.crt',
222 'value_from': None},
223 {'name': 'AWS_REQUEST_CHECKSUM_CALCULATION',
224 'value': 'WHEN_REQUIRED',
225 'value_from': None},
226 {'name': 'AWS_RESPONSE_CHECKSUM_VALIDATION',
227 'value': 'WHEN_REQUIRED',
228 'value_from': None},
229 {'name': 'HADOOP_CONF_DIR',
230 'value': '/etc/hadoop/conf',
231 'value_from': None},
232 {'name': 'HIVE_CONF_DIR',
233 'value': '/etc/hadoop/conf',
234 'value_from': None},
235 {'name': 'KRB5CCNAME',
236 'value': '/tmp/airflow_krb5_ccache/krb5cc',
237 'value_from': None},
238 {'name': 'KRB5_CONFIG',
239 'value': '/etc/krb5.conf',
240 'value_from': None}],
241 'env_from': None,
242 'image': 'docker-registry.discovery.wmnet/repos/machine-learning/ml-pipelines:job-604192',
243 'image_pull_policy': 'IfNotPresent',
244 'lifecycle': None,
245 'liveness_probe': None,
246 'name': 'base',
247 'ports': None,
248 'readiness_probe': None,
249 'resize_policy': None,
250 'resources': {'claims': None,
251 'limits': {'cpu': '2', 'memory': '3Gi'},
252 'requests': {'cpu': '1',
253 'memory': '1500Mi'}},
254 'restart_policy': None,
255 'security_context': {'allow_privilege_escalation': False,
256 'app_armor_profile': None,
257 'capabilities': {'add': None,
258 'drop': ['ALL']},
259 'privileged': None,
260 'proc_mount': None,
261 'read_only_root_filesystem': None,
262 'run_as_group': None,
263 'run_as_non_root': True,
264 'run_as_user': None,
265 'se_linux_options': None,
266 'seccomp_profile': {'localhost_profile': None,
267 'type': 'RuntimeDefault'},
268 'windows_options': None},
269 'startup_probe': None,
270 'stdin': None,
271 'stdin_once': None,
272 'termination_message_path': '/dev/termination-log',
273 'termination_message_policy': 'File',
274 'tty': None,
275 'volume_devices': None,
276 'volume_mounts': [{'mount_path': '/etc/hadoop/conf',
277 'mount_propagation': None,
278 'name': 'airflow-hadoop-configuration',
279 'read_only': None,
280 'recursive_read_only': None,
281 'sub_path': None,
282 'sub_path_expr': None},
283 {'mount_path': '/etc/krb5.conf',
284 'mount_propagation': None,
285 'name': 'airflow-kerberos-client-config',
286 'read_only': None,
287 'recursive_read_only': None,
288 'sub_path': 'krb5.conf',
289 'sub_path_expr': None},
290 {'mount_path': '/tmp/airflow_krb5_ccache',
291 'mount_propagation': None,
292 'name': 'airflow-kerberos-token',
293 'read_only': None,
294 'recursive_read_only': None,
295 'sub_path': None,
296 'sub_path_expr': None},
297 {'mount_path': '/mnt/model-training',
298 'mount_propagation': None,
299 'name': 'airflow-ml-model-training-volume',
300 'read_only': None,
301 'recursive_read_only': None,
302 'sub_path': None,
303 'sub_path_expr': None},
304 {'mount_path': '/var/run/secrets/kubernetes.io/serviceaccount',
305 'mount_propagation': None,
306 'name': 'kube-api-access-gw959',
307 'read_only': True,
308 'recursive_read_only': None,
309 'sub_path': None,
310 'sub_path_expr': None}],
311 'working_dir': None}],
312 'dns_config': None,
313 'dns_policy': 'ClusterFirst',
314 'enable_service_links': True,
315 'ephemeral_containers': None,
316 'host_aliases': None,
317 'host_ipc': None,
318 'host_network': None,
319 'host_pid': None,
320 'host_users': None,
321 'hostname': None,
322 'image_pull_secrets': None,
323 'init_containers': None,
324 'node_name': 'dse-k8s-worker1012.eqiad.wmnet',
325 'node_selector': None,
326 'os': None,
327 'overhead': None,
328 'preemption_policy': 'PreemptLowerPriority',
329 'priority': -100,
330 'priority_class_name': 'low-priority-pod',
331 'readiness_gates': None,
332 'resource_claims': None,
333 'resources': None,
334 'restart_policy': 'Never',
335 'runtime_class_name': None,
336 'scheduler_name': 'default-scheduler',
337 'scheduling_gates': None,
338 'security_context': {'app_armor_profile': None,
339 'fs_group': 900,
340 'fs_group_change_policy': None,
341 'run_as_group': None,
342 'run_as_non_root': None,
343 'run_as_user': None,
344 'se_linux_change_policy': None,
345 'se_linux_options': None,
346 'seccomp_profile': None,
347 'supplemental_groups': None,
348 'supplemental_groups_policy': None,
349 'sysctls': None,
350 'windows_options': None},
351 'service_account': 'default',
352 'service_account_name': 'default',
353 'set_hostname_as_fqdn': None,
354 'share_process_namespace': None,
355 'subdomain': None,
356 'termination_grace_period_seconds': 30,
357 'tolerations': [{'effect': 'NoExecute',
358 'key': 'node.kubernetes.io/not-ready',
359 'operator': 'Exists',
360 'toleration_seconds': 300,
361 'value': None},
362 {'effect': 'NoExecute',
363 'key': 'node.kubernetes.io/unreachable',
364 'operator': 'Exists',
365 'toleration_seconds': 300,
366 'value': None}],
367 'topology_spread_constraints': None,
368 'volumes': [{'aws_elastic_block_store': None,
369 'azure_disk': None,
370 'azure_file': None,
371 'cephfs': None,
372 'cinder': None,
373 'config_map': {'default_mode': 420,
374 'items': None,
375 'name': 'airflow-hadoop-configuration',
376 'optional': None},
377 'csi': None,
378 'downward_api': None,
379 'empty_dir': None,
380 'ephemeral': None,
381 'fc': None,
382 'flex_volume': None,
383 'flocker': None,
384 'gce_persistent_disk': None,
385 'git_repo': None,
386 'glusterfs': None,
387 'host_path': None,
388 'image': None,
389 'iscsi': None,
390 'name': 'airflow-hadoop-configuration',
391 'nfs': None,
392 'persistent_volume_claim': None,
393 'photon_persistent_disk': None,
394 'portworx_volume': None,
395 'projected': None,
396 'quobyte': None,
397 'rbd': None,
398 'scale_io': None,
399 'secret': None,
400 'storageos': None,
401 'vsphere_volume': None},
402 {'aws_elastic_block_store': None,
403 'azure_disk': None,
404 'azure_file': None,
405 'cephfs': None,
406 'cinder': None,
407 'config_map': {'default_mode': 420,
408 'items': None,
409 'name': 'airflow-kerberos-client-config',
410 'optional': None},
411 'csi': None,
412 'downward_api': None,
413 'empty_dir': None,
414 'ephemeral': None,
415 'fc': None,
416 'flex_volume': None,
417 'flocker': None,
418 'gce_persistent_disk': None,
419 'git_repo': None,
420 'glusterfs': None,
421 'host_path': None,
422 'image': None,
423 'iscsi': None,
424 'name': 'airflow-kerberos-client-config',
425 'nfs': None,
426 'persistent_volume_claim': None,
427 'photon_persistent_disk': None,
428 'portworx_volume': None,
429 'projected': None,
430 'quobyte': None,
431 'rbd': None,
432 'scale_io': None,
433 'secret': None,
434 'storageos': None,
435 'vsphere_volume': None},
436 {'aws_elastic_block_store': None,
437 'azure_disk': None,
438 'azure_file': None,
439 'cephfs': None,
440 'cinder': None,
441 'config_map': None,
442 'csi': None,
443 'downward_api': None,
444 'empty_dir': None,
445 'ephemeral': None,
446 'fc': None,
447 'flex_volume': None,
448 'flocker': None,
449 'gce_persistent_disk': None,
450 'git_repo': None,
451 'glusterfs': None,
452 'host_path': None,
453 'image': None,
454 'iscsi': None,
455 'name': 'airflow-kerberos-token',
456 'nfs': None,
457 'persistent_volume_claim': {'claim_name': 'airflow-kerberos-token-pvc',
458 'read_only': None},
459 'photon_persistent_disk': None,
460 'portworx_volume': None,
461 'projected': None,
462 'quobyte': None,
463 'rbd': None,
464 'scale_io': None,
465 'secret': None,
466 'storageos': None,
467 'vsphere_volume': None},
468 {'aws_elastic_block_store': None,
469 'azure_disk': None,
470 'azure_file': None,
471 'cephfs': None,
472 'cinder': None,
473 'config_map': None,
474 'csi': None,
475 'downward_api': None,
476 'empty_dir': None,
477 'ephemeral': None,
478 'fc': None,
479 'flex_volume': None,
480 'flocker': None,
481 'gce_persistent_disk': None,
482 'git_repo': None,
483 'glusterfs': None,
484 'host_path': None,
485 'image': None,
486 'iscsi': None,
487 'name': 'airflow-ml-model-training-volume',
488 'nfs': None,
489 'persistent_volume_claim': {'claim_name': 'airflow-ml-model-training',
490 'read_only': None},
491 'photon_persistent_disk': None,
492 'portworx_volume': None,
493 'projected': None,
494 'quobyte': None,
495 'rbd': None,
496 'scale_io': None,
497 'secret': None,
498 'storageos': None,
499 'vsphere_volume': None},
500 {'aws_elastic_block_store': None,
501 'azure_disk': None,
502 'azure_file': None,
503 'cephfs': None,
504 'cinder': None,
505 'config_map': None,
506 'csi': None,
507 'downward_api': None,
508 'empty_dir': None,
509 'ephemeral': None,
510 'fc': None,
511 'flex_volume': None,
512 'flocker': None,
513 'gce_persistent_disk': None,
514 'git_repo': None,
515 'glusterfs': None,
516 'host_path': None,
517 'image': None,
518 'iscsi': None,
519 'name': 'kube-api-access-gw959',
520 'nfs': None,
521 'persistent_volume_claim': None,
522 'photon_persistent_disk': None,
523 'portworx_volume': None,
524 'projected': {'default_mode': 420,
525 'sources': [{'cluster_trust_bundle': None,
526 'config_map': None,
527 'downward_api': None,
528 'secret': None,
529 'service_account_token': {'audience': None,
530 'expiration_seconds': 3607,
531 'path': 'token'}},
532 {'cluster_trust_bundle': None,
533 'config_map': {'items': [{'key': 'ca.crt',
534 'mode': None,
535 'path': 'ca.crt'}],
536 'name': 'kube-root-ca.crt',
537 'optional': None},
538 'downward_api': None,
539 'secret': None,
540 'service_account_token': None},
541 {'cluster_trust_bundle': None,
542 'config_map': None,
543 'downward_api': {'items': [{'field_ref': {'api_version': 'v1',
544 'field_path': 'metadata.namespace'},
545 'mode': None,
546 'path': 'namespace',
547 'resource_field_ref': None}]},
548 'secret': None,
549 'service_account_token': None}]},
550 'quobyte': None,
551 'rbd': None,
552 'scale_io': None,
553 'secret': None,
554 'storageos': None,
555 'vsphere_volume': None}]},
556 'status': {'conditions': [{'last_probe_time': None,
557 'last_transition_time': datetime.datetime(2025, 10, 3, 16, 58, 20, tzinfo=tzlocal()),
558 'message': None,
559 'reason': None,
560 'status': 'True',
561 'type': 'PodScheduled'}],
562 'container_statuses': None,
563 'ephemeral_container_statuses': None,
564 'host_i_ps': None,
565 'host_ip': None,
566 'init_container_statuses': None,
567 'message': None,
568 'nominated_node_name': None,
569 'phase': 'Pending',
570 'pod_i_ps': None,
571 'pod_ip': None,
572 'qos_class': 'Burstable',
573 'reason': None,
574 'resize': None,
575 'resource_claim_statuses': None,
576 'start_time': None}}

Going to investigate why this failed in prod yet it worked in staging.

After stripping down the tone-check pipeline, it still ran successfully in airflow-devenv but failed in airflow-ml as shown in the logs below.

kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '8f2e38fe-e84c-4464-87b7-ee511b7962cf', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'X-Kubernetes-Pf-Flowschema-Uid': 'c61211d6-8e62-4684-8f2d-3137b5d5c5dd', 'X-Kubernetes-Pf-Prioritylevel-Uid': '0b1de1a9-d62d-4e1d-8b6d-75ef91c83d5f', 'Date': 'Fri, 03 Oct 2025 16:58:23 GMT', 'Content-Length': '294'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"events is forbidden: User \"system:serviceaccount:airflow-ml:airflow\" cannot list resource \"events\" in API group \"\" in the namespace \"airflow-ml\"","reason":"Forbidden","details":{"kind":"events"},"code":403}

The error seems to be caused by the service account lacking required permissions to list events in the airflow-ml namespace.

I have deployed the new version of the airflow chart to the airflow-ml instance.

This has fixed the issue, I believe.

btullis@deploy2002:~$ kube-env airflow-ml-deploy dse-k8s-eqiad
btullis@deploy2002:~$ kubectl auth can-i list events
yes

I have deployed the new version of the airflow chart to the airflow-ml instance.

This has fixed the issue, I believe.

btullis@deploy2002:~$ kube-env airflow-ml-deploy dse-k8s-eqiad
btullis@deploy2002:~$ kubectl auth can-i list events
yes

Yes, the tone_check_copy_hdfs_to_pvc_dag that uses the WMFKubernetesPodOperator in airflow-ml has now succeeded. Thanks a lot for your help!

Now the tone_check_retrain_dag is failing with the error below:

'message': '0/21 nodes are available: 1 node(s) '
            "didn't match Pod's node "
            'affinity/selector, 18 Insufficient '
            'amd.com/gpu, 2 node(s) had taint '
            '{node-role.kubernetes.io/control-plane: '
            "}, that the pod didn't tolerate.",
'reason': 'Unschedulable',

Full logs can be found here. This DAG seems to be failing because it is requesting a GPU, but all the GPUs are currently allocated to other jobs.

kevinbazira updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1731

ml: Enable deferrable execution of training operator to handle resource contention

kevinbazira merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1731

ml: Enable deferrable execution of training operator to handle resource contention

After enabling deferrable execution of the training operator to handle GPU resource contention, the following warning appears in both the airflow-devenv and airflow-ml web UI:

The triggerer does not appear to be running.

Triggers will not run, and any deferred operator will remain deferred until it times out and fails.

It looks like the triggerer process is not running:

$ airflow-devenv show dev-kevinbazira
NAME                                                             READY   STATUS    RESTARTS        AGE
airflow-dev-kevinbazira-envoy-7fd57d66b5-w9pr6                   1/1     Running   0               6m44s
airflow-dev-kevinbazira-gitsync-775d85c76d-5hsp8                 1/1     Running   0               6m44s
airflow-dev-kevinbazira-hadoop-shell-6b69d4568d-pdc2n            1/1     Running   0               6m44s
airflow-dev-kevinbazira-kerberos-79c595b79d-84w6x                1/1     Running   0               6m44s
airflow-dev-kevinbazira-scheduler-bd9c948c4-9hr5m                1/1     Running   2 (6m15s ago)   6m44s
airflow-dev-kevinbazira-statsd-b5f9b786f-6s7lh                   1/1     Running   0               6m44s
airflow-dev-kevinbazira-task-shell-6574b9b567-z26xh              1/1     Running   0               6m44s
airflow-dev-kevinbazira-webserver-6888756fb6-9r2jp               2/2     Running   1 (5m47s ago)   6m44s
retrain-tone-check-6u3hj5c                                       0/1     Pending   0               2m37s
tone-check-data-generation-pipeline-generate-training-xk34fcsf   1/1     Running   0               3m22s

@BTullis, in this slack thread, you mentioned that you had tried the triggerer process. Could you please help us activate it in both airflow-devenv and airflow-ml.

Since using the TriggerDagRunOperator for cross-DAG orchestration requires one to always check that all DAGs are unpaused, the ML team decided to T407212: Merge tone-check pipeline DAGs into a single DAG for simplified orchestration