Page MenuHomePhabricator

Create an analytics service user for the ML team
Closed, ResolvedPublic

Description

While doing some debugging, we found out an oddity for the Airflow artifacts of the ML team:

$ pwd
/mnt/hdfs/wmf/cache/artifacts/airflow
xcollazo@stat1011:/mnt/hdfs/wmf/cache/artifacts/airflow$ ls -lsha
total 48K
4.0K drwxrwxr-x  12 analytics              analytics-privatedata-users 4.0K Jul 16 08:38 .
4.0K drwxrwxr-x   3 analytics              analytics-privatedata-users 4.0K Feb 10  2022 ..
4.0K drwxrwx--- 113 analytics              analytics-privatedata-users 4.0K Jul 29 14:29 analytics
4.0K drwxrwx---  21 analytics-product      analytics-privatedata-users 4.0K Jul  8 18:27 analytics_product
4.0K drwxr-x---  28 analytics              analytics-privatedata-users 4.0K Jul 18 18:31 analytics_test
4.0K drwxrwx---   2 analytics              analytics-privatedata-users 4.0K Feb 28 17:17 main
4.0K drwxrwx---  15 kevinbazira            analytics-privatedata-users 4.0K Jul 30 17:32 ml                      <<<<<<<<<<<<<<<<<<<<
4.0K drwxr-x---   3 ozge                   analytics-privatedata-users 4.0K Jul 16 08:38 mlozge                  <<<<<<<<<<<<<<<<<<<<
4.0K drwxrwx---  80 analytics-platform-eng analytics-privatedata-users 4.0K Jul  9 10:53 platform_eng
4.0K drwxrwx---  12 analytics-research     analytics-privatedata-users 4.0K Jul 16 08:34 research
4.0K drwxrwx---  58 analytics-search       analytics-privatedata-users 4.0K Jul 16 20:18 search
4.0K drwxrwx---   4 analytics-wmde         analytics-privatedata-users 4.0K Dec 19  2023 wmde

Normally these folders should be owned by a system user, in this case perhaps analytics-ml.

@kevinbazira explains:

We created /wmf/cache/artifacts/airflow/ml since we had to manually upload artifacts before this was automated with blunderbuss.

Slack thread for this convo.

Although this was a good temporary solution, we should give the ML team a proper service user to own the Airflow assets, as well as future assets like ML owned hive and Iceberg tables.

In this task we should:

  • Create an analytics-ml service user for the ML team
  • Make sure all existing members of the ML team are part of a group like analytics-ml-users so that they can control their analytics assets.
  • Remove the presumably temporary HDFS folder /wmf/cache/artifacts/airflow/mlozge.
  • Change ownership of /wmf/cache/artifacts/airflow/ml to the new analytics-ml service user.
  • Make sure existing and future Airflow jobs run as the new analytics-ml service user.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
brouberol changed the task status from Open to In Progress.Aug 1 2025, 2:37 PM
brouberol claimed this task.

Regarding this item:

Change ownership of /wmf/cache/artifacts/airflow/ml to the new analytics-ml service user.

I think that there is cross-over with the conversation here about blunderbuss having its own system user and group.

the topic at hand is to somehow reduce the privileges that blunderbuss has today on HDFS, because we noted a privilege escalation to analytics. We want to blunderbuss to have just the right amount of privileges to accomplish its current use case of landing files into the Airflow caches.

I am not sure whether that conversation has resulted in a ticket (or tickets) yet, but ultimately we probably want the files in /wmf/cache/artifacts/airflow to be owned by the blunderbuss system user, but readable by the airflow system users.

I have now created: T401103: Ensure that blunderbuss uses the minimum HDFS file system permissions required to deal with the file ownership issue of Airflow artifacts that are deployed by blunderbuss.

Change #1175890 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] data: define ML-related user and group

https://gerrit.wikimedia.org/r/1175890

Change #1175890 merged by Brouberol:

[operations/puppet@production] data: define ML-related user and group

https://gerrit.wikimedia.org/r/1175890

brouberol@krb1002:~$ sudo kadmin.local addprinc -randkey analytics-ml/airflow-ml.discovery.wmnet@WIKIMEDIA
brouberol@krb1002:~$ sudo kadmin.local ktadd -norandkey -k airlfow-ml.keytab     analytics-ml/airflow-ml.discovery.wmnet     airflow/airflow-ml.discovery.wmnet@WIKIMEDIA     HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA
Entry for principal analytics-ml/airflow-ml.discovery.wmnet with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:airlfow-ml.keytab.
Entry for principal airflow/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:airlfow-ml.keytab.
Entry for principal HTTP/airflow-ml.discovery.wmnet@WIKIMEDIA with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:airlfow-ml.keytab.

I'm going to now redeploy airflow-ml with that keytab.

brouberol@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -chown -R analytics-ml:analytics-ml-users  /wmf/cache/artifacts/airflow/ml/
brouberol@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /wmf/cache/artifacts/airflow/ml
Found 17 items
-rw-r-----   3 analytics-ml analytics-ml-users 1049095109 2025-07-28 09:50 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.1.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048901241 2025-08-01 10:49 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.10.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049126909 2025-07-28 13:01 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.2.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049135031 2025-07-29 13:12 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.3.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049172687 2025-07-30 15:21 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.4.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049141625 2025-07-30 15:56 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.5.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049140425 2025-07-30 16:49 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.6.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1049122207 2025-07-30 17:32 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.7.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048904428 2025-07-31 09:13 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.8.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048936854 2025-07-31 10:24 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-testv0.0.9.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048381631 2025-07-21 13:12 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-v0.0.1.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048955463 2025-08-01 08:40 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1-v0.0.2.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1048446656 2025-07-18 12:12 /wmf/cache/artifacts/airflow/ml/add_a_link-0.1.1.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users 1022785897 2025-07-11 00:55 /wmf/cache/artifacts/airflow/ml/add_a_link_etl-0.1.1.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users  402926424 2025-06-17 22:34 /wmf/cache/artifacts/airflow/ml/example-0.1.0.conda.tgz
-rw-r-----   3 analytics-ml analytics-ml-users   24188575 2025-06-15 10:11 /wmf/cache/artifacts/airflow/ml/hdfs-tools-0.0.6-shaded.jar
-rw-r-----   3 analytics-ml analytics-ml-users      21080 2025-06-15 10:11 /wmf/cache/artifacts/airflow/ml/wmf-sparksqlclidriver-1.0.0.jar

Change #1175908 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] Create the analytics-ml-users group stat hosts

https://gerrit.wikimedia.org/r/1175908

Change #1175910 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] airflow-ml: update principal to reflect recent change to analytics-ml

https://gerrit.wikimedia.org/r/1175910

airflow@airflow-kerberos-6766cc659b-ns96p:/opt/airflow$ klist
Ticket cache: FILE:/tmp/airflow_krb5_ccache/krb5cc
Default principal: analytics-ml/airflow-ml.discovery.wmnet@WIKIMEDIA

Valid starting     Expires            Service principal
08/05/25 14:43:56  08/07/25 14:43:56  krbtgt/WIKIMEDIA@WIKIMEDIA
	renew until 08/08/25 02:43:54

Change #1175910 merged by Brouberol:

[operations/deployment-charts@master] airflow-ml: update principal to reflect recent change to analytics-ml

https://gerrit.wikimedia.org/r/1175910

Change #1175908 merged by Brouberol:

[operations/puppet@production] Create the analytics-ml-users group stat hosts

https://gerrit.wikimedia.org/r/1175908

brouberol@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -rm -r /wmf/cache/artifacts/airflow/mlozge
25/08/05 14:46:55 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/cache/artifacts/airflow/mlozge' to trash at: hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/cache/artifacts/airflow/mlozge
brouberol@an-launcher1002:~$ sudo kerberos-run-command hdfs hdfs dfs -ls /wmf/cache/artifacts/airflow/
Found 9 items
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:32 /wmf/cache/artifacts/airflow/analytics
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:32 /wmf/cache/artifacts/airflow/analytics_product
drwxr-x---   - analytics    analytics-privatedata-users          0 2025-08-04 21:33 /wmf/cache/artifacts/airflow/analytics_test
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-02-28 17:17 /wmf/cache/artifacts/airflow/main
drwxrwx---   - analytics-ml analytics-ml-users                   0 2025-08-01 10:49 /wmf/cache/artifacts/airflow/ml
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:34 /wmf/cache/artifacts/airflow/platform_eng
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:35 /wmf/cache/artifacts/airflow/research
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:35 /wmf/cache/artifacts/airflow/search
drwxrwx---   - analytics    analytics-privatedata-users          0 2025-08-04 21:35 /wmf/cache/artifacts/airflow/wmde
brouberol@stat1011:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Warning: The current total number of facts: 2645 exceeds the number of facts limit: 2048
Info: Caching catalog for stat1011.eqiad.wmnet
...
Notice: /Stage[main]/Admin/Admin::Hashgroup[analytics-ml-users]/Admin::Group[analytics-ml-users]/Group[analytics-ml-users]/ensure: created
Notice: /Stage[main]/Admin/Admin::Groupmembers[analytics-ml-users]/Exec[analytics-ml-users_ensure_members]/returns: executed successfully
...
brouberol@stat1011:~$ sudo -i -u kevinbazira

You do not have a valid Kerberos ticket in the credential cache, remember to kinit.
kevinbazira@stat1011:~$ groups
wikidev render analytics-privatedata-users analytics-ml-users ml-team-admins deploy-ml-service

@brouberol thank you for the recent updates to the HDFS permissions and ownership for /wmf/cache/artifacts/airflow/ml. I noticed that after these changes a dag run using the airflow-devenv fails as shown in: https://phabricator.wikimedia.org/P81347

thank you @brouberol for working on this.
I get a similar error for this run:

skein.exceptions.DriverError: Failed to submit application, exception:
Permission denied: user=analytics-ml, access=WRITE, inode="/user":hdfs:hadoop:drwxrwxr-x

hello,

I'm not sure if this is related but I'm trying the same pipeline from airflow dev and I'm getting the following error:

[2025-08-18, 09:18:08 UTC] {skein.py:98} INFO - Constructing skein Client with kwargs: {}
[2025-08-18, 09:18:08 UTC] {logging_mixin.py:190} WARNING - /tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/skein/exceptions.py:56 UserWarning: Skein global security credentials not found, writing now to '/home/airflow/.skein'.
[2025-08-18, 09:18:11 UTC] {taskinstance.py:3313} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 768, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 734, in _execute_callable
    return ExecutionCallableRunner(
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/utils/operator_helpers.py", line 252, in run
    return self.func(*args, **kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/models/baseoperator.py", line 424, in wrapper
    return func(self, *args, **kwargs)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/airflow/providers/apache/spark/operators/spark_submit.py", line 174, in execute
    self._hook.submit(self.application)
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/spark.py", line 494, in submit
    return self._skein_hook.submit()
  File "/opt/airflow/dags/airflow_dags/wmf_airflow_common/hooks/skein.py", line 224, in submit
    self._application_id = self._client.submit(self._application_spec)
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/skein/core.py", line 510, in submit
    resp = self._call('submit', spec.to_protobuf())
  File "/tmp/pyenv/versions/3.10.15/lib/python3.10/site-packages/skein/core.py", line 290, in _call
    raise self._server_error(exc.details())
skein.exceptions.DriverError: Kerberos ticket not found, please kinit and restart

hello,

I'm not sure if this is related but I'm trying the same pipeline from airflow dev and I'm getting the following error:

We can ignore this one which is related to airflow dev. After a refresh start, I get the same error as Kevin's.

Change #1180123 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] admin/data: add the analytics-ml system user to the analytics-privatedata users

https://gerrit.wikimedia.org/r/1180123

As part of https://phabricator.wikimedia.org/T401103, we are going to make /wmf/cache/artifacts/airflow readable by anyone, as these are artifacts that are going to be populated by blunderbuss. I've manually set the following permissions on /wmf/cache/artifacts/airflow/ml

drwxr-x---   - analytics-ml analytics-privatedata-users          0 2025-08-14 09:24 hdfs:///wmf/cache/artifacts/airflow/ml

I've also created /tmp/ml with the same set of permissions.

Change #1180123 merged by Brouberol:

[operations/puppet@production] admin/data: add the analytics-ml system user to the analytics-privatedata users

https://gerrit.wikimedia.org/r/1180123

@brouberol thank you for adding the analytics-ml system user to the analytics-privatedata-users group. Since manual uploads to wmf/cache/artifacts/airflow/ml are no longer possible and artifacts are now added to this cache directory by blunderbuss when MRs are merged to the main branch, I used my own personal cache directory to test the new HDFS permissions.

With this setup, the DAG ran successfully without any permission issues: it was able to read artifacts from /user/kevinbazira/artifacts/, access a table from the data lake, and store a parquet file in /user/kevinbazira/example_etl_data_generation_pipeline.

@brouberol thank you
I think we are almost there.
I think analytics-ml user is missing some more permissions:

skein.exceptions.DriverError: Failed to submit application, exception:
org.apache.hadoop.security.AccessControlException: User analytics-ml does not have permission to submit application_1754906949114_253169 to queue production

https://airflow-ml.wikimedia.org/dags/add_a_link_pipeline/grid?dag_run_id=manual__2025-08-20T06%3A43%3A38.249112%2B00%3A00&tab=mapped_tasks&task_id=generate_anchor_dictionary&map_index=1

Change #1180509 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] yarn: allow the analytics-ml user to send jobs in the production queue

https://gerrit.wikimedia.org/r/1180509

Change #1180509 merged by Brouberol:

[operations/puppet@production] yarn: allow the analytics-ml user to send jobs in the production queue

https://gerrit.wikimedia.org/r/1180509

@OKarakaya-WMF I merged a patch adding u:analytics-ml:production to /etc/hadoop/conf/capacity-scheduler.xml:

--- /etc/hadoop/conf/capacity-scheduler.xml	2025-07-10 14:38:43.915836903 +0000
+++ /tmp/puppet-file20250820-3921031-dco65z	2025-08-20 09:27:06.275646424 +0000
@@ -13,7 +13,7 @@
   </property>
   <property>
     <name>yarn.scheduler.capacity.queue-mappings</name>
-    <value>u:druid:production,u:analytics:production,u:analytics-platform-eng:production,u:analytics-research:production,u:analytics-search:production,u:analytics-product:production,u:analytics-wmde:production,g:analytics-privatedata-users:default</value>
+    <value>u:druid:production,u:analytics:production,u:analytics-platform-eng:production,u:analytics-research:production,u:analytics-search:production,u:analytics-product:production,u:analytics-wmde:production,u:analytics-ml:production,g:analytics-privatedata-users:default</value>
   </property>
   <property>
     <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>

and then ran

brouberol@an-master1003:~$ sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues
brouberol@an-master1003:~$

on the current Hadoop master.

Feel free to retry.

hi @brouberol

Thank you for the patch 🤗

I've just re-tried the pipeline and I get the following error. Should we also refresh something on ml airflow side?

https://airflow-ml.wikimedia.org/dags/add_a_link_pipeline/grid?dag_run_id=manual__2025-08-20T10%3A33%3A29.093095%2B00%3A00&tab=mapped_tasks&task_id=generate_anchor_dictionary&map_index=1

Caused by: org.apache.hadoop.security.AccessControlException: User analytics-ml does not have permission to submit application_1754906949114_259290 to queue production
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:454)
	... 13 more

<sigh> my fault. Sorry, I'm not super versed in Hadoop, so I have to reverse engineer many things. I've identified a missing ACL in the configuration. I know how to reproduce the issue, so I'll notify you when things work, instead of asking you to test things for me.

Change #1180528 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] yarn: grant analytics-ml the acls required to send jobs in the production queue

https://gerrit.wikimedia.org/r/1180528

Change #1180528 merged by Brouberol:

[operations/puppet@production] yarn: grant analytics-ml the acls required to send jobs in the production queue

https://gerrit.wikimedia.org/r/1180528

no worries at all. I'm available whenever I can help :)

25/08/20 11:58:21 INFO impl.YarnClientImpl: Submitted application application_1754906949114_260505

Aha!

Hi @brouberol,

I think we have a new small issue.

I can't see the logs for the pipeline anymore.
Could you please help to check if ml team should be in a specific group?

ozge@stat1010:~$ yarn logs -appOwner analytics-ml -applicationId application_1754906949114_261287
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Permission denied: user=ozge, access=EXECUTE, inode="/var/log/hadoop-yarn/apps/analytics-ml":analytics-ml:mapred:drwxrwx---

Funnily enough, I was just working on https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1616

Following these instructions works, as the pod runs with the appropriate analytics-ml kerberos primary:

brouberol@deploy1003:~$ kube_env airflow-ml-deploy dse-k8s-eqiad
brouberol@deploy1003:~$ kubectl exec -it $(kubectl get pod -l app=airflow,component=hadoop-shell --no-headers -o custom-columns=":metadata.name") -- bash
airflow@airflow-hadoop-shell-556598697d-fvkfg:/opt/airflow$ yarn logs -appOwner analytics-ml -applicationId application_1754906949114_261287 | head
25/08/20 12:53:35 INFO ZlibFactory: Successfully loaded & initialized native-zlib library
25/08/20 12:53:36 INFO CodecPool: Got brand-new decompressor [.deflate]
Container: container_e139_1754906949114_261287_01_000001 on an-worker1157.eqiad.wmnet_8041_1755693624520
LogAggregationType: AGGREGATED
========================================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Wed Aug 20 12:40:24 +0000 2025
LogLength:184
LogContents:
2025-08-20 12:38:58,367 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.

End of LogType:container-localizer-syslog

Looking at these logs, I'm seeing

...
Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:200)
	... 33 more
Caused by: java.lang.RuntimeException: Duplicate map key Jimmy Snuka was found, please check the input data. If you want to remove the duplicated keys, you can set spark.sql.mapKeyDedupPolicy to LAST_WIN so that the key inserted at last takes precedence.

thank you @brouberol ,

Can we get read permissions for the ml team members?

I can do that. Just to validate, what command did you run and from what host?

sure,
host: stat1010
command: yarn logs -appOwner analytics-ml -applicationId application_1754906949114_261287

Change #1180559 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] stat: deploy an analytics-ml keytab on each host

https://gerrit.wikimedia.org/r/1180559

hi @brouberol ,

to use conda artifacts in airflow, we manually copy artifacts in gitlab to hdfs:///wmf/cache/artifacts/airflow/ml

However, we don't have permissions to write to this location anymore. Do you have any suggestions how to access artifacts in airflow?

Write permissions to /wmf/cache/artifacts/airflow/ml would sufficient but I'm also open to the other ideas

our config: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/ml/config/artifacts.yaml#L33

Thank you!

ohh I guess we will have this permission after this patch is merged:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180559

If so, no problem and no additional actions needed.

@brouberol

hi @brouberol ,

to use conda artifacts in airflow, we manually copy artifacts in gitlab to hdfs:///wmf/cache/artifacts/airflow/ml

However, we don't have permissions to write to this location anymore. Do you have any suggestions how to access artifacts in airflow?

Write permissions to /wmf/cache/artifacts/airflow/ml would sufficient but I'm also open to the other ideas

our config: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/ml/config/artifacts.yaml#L33

Thank you!

Now that we have a proper service user, hdfs:///wmf/cache/artifacts/airflow/ml becomes a production asset. I suggest we do not manually add non-production assets to that folder unless it is an emergency.

If your intention is to test your artifacts before deploying to production, we provide the DagProperties mechanism for this. Here is the definition, and here is an example usage with an artifact. Instead of copying your test artifact to the above folder, you would copy it to a folder you own, and then chmod the file so that it is world readable and executable ( See https://gitlab.wikimedia.org/-/snippets/71 for more info). And then you can point to that artfiact by overriding the DagProperty via Airflow's Variables tab.

Change #1180559 merged by Brouberol:

[operations/puppet@production] stat: deploy an analytics-ml keytab on each host

https://gerrit.wikimedia.org/r/1180559

Change #1182095 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/puppet@production] stat: group chown the analytics-ml to analytics-ml-users

https://gerrit.wikimedia.org/r/1182095

hey @brouberol ,

I'm getting following errors. Could it be related to the patches above?

ozge@stat1010:~$ kinit
Password for analytics-ml/stat1010.eqiad.wmnet@WIKIMEDIA:
kinit: Password incorrect while getting initial credentials
hdfs dfs -ls /user/ozge/addalink
Permission denied: user=analytics-ml, access=EXECUTE, inode="/user/ozge":ozge:ozge:drwxr-x---
hdfs dfs -ls /tmp/ozge/addalink
ls: Permission denied: user=analytics-ml, access=READ_EXECUTE, inode="/tmp/ozge/addalink":ozge:hdfs:drwxr-x---

Yep, sorry, that's on me. While stat: group chown the analytics-ml to analytics-ml-users is out, I manually group-chowned the keytab directory to analytics-ml-users to check that it would work and ran kerberos-run-command analytics-ml hdfs dfs -ls as your user on stat1010, which populated a kerberos cache file for that specific keytab, which permissions wehre then reverted by puppet. I deleted the cache file for now, until https://gerrit.wikimedia.org/r/c/operations/puppet/+/1182095 can get properly merged.

Try again?

cool, no problem! it's back to normal 😍

Change #1182095 merged by Brouberol:

[operations/puppet@production] stat: group chown the analytics-ml to analytics-ml-users

https://gerrit.wikimedia.org/r/1182095

The patch has been merged, so running kerberos-run-command analytics-ml hdfs dfs should now work and allow y'all to impersonate analytics-ml in HDFS.

Hey @brouberol

I don't have access to yarn logs. Is it expected?

ozge@stat1010:~$ yarn logs -appOwner analytics-ml -applicationId application_1754906949114_424446
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Permission denied: user=ozge, access=EXECUTE, inode="/var/log/hadoop-yarn/apps/analytics-ml":analytics-ml:mapred:drwxrwx---

logs produced by this run:

https://airflow-ml.wikimedia.org/dags/add_a_link_pipeline/grid?tab=graph&dag_run_id=manual__2025-08-27T12%3A58%3A19.312216%2B00%3A00&task_id=generate_anchor_dictionary

Have you tried running kerberos-run-command analytics-ml yarn logs -appOwner analytics-ml -applicationId application_1754906949114_424446 to impersonate the analytics-ml user?

No problem!

I also would like to emphasize that the canonical way to get yarn logs when your airflow instance runs in Kubernetes is detailed here https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI

The pod running in your airlfow NS already embarks an analytics-ml Kerberos primary. This is also documented in Skein jobs outputs. Ex: airflow-ml.wikimedia.org / add_a_link_pipeline / manual__2025-08-21T13:43:12.452185+00:00 / generate_anchor_dictionary

[2025-08-21, 14:03:46 UTC] {skein.py:296} INFO - SkeinHook Airflow SparkSkeinSubmitHook skein launcher add_a_link_pipeline__generate_anchor_dictionary__20250821 application_1754906949114_286213 - YARN application log collection is disabled. To view logs for the YARN App Master, run the following command:
	See also https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Airflow/Kubernetes#Use_the_yarn_CLI
	yarn logs -appOwner analytics-ml -applicationId application_1754906949114_286213

hey @brouberol ,

I'm getting following errors. Could it be related to the patches above?

ozge@stat1010:~$ kinit
Password for analytics-ml/stat1010.eqiad.wmnet@WIKIMEDIA:
kinit: Password incorrect while getting initial credentials
hdfs dfs -ls /user/ozge/addalink
Permission denied: user=analytics-ml, access=EXECUTE, inode="/user/ozge":ozge:ozge:drwxr-x---
hdfs dfs -ls /tmp/ozge/addalink
ls: Permission denied: user=analytics-ml, access=READ_EXECUTE, inode="/tmp/ozge/addalink":ozge:hdfs:drwxr-x---

Good morning @brouberol ,

I'm having the same issue. Do you have an idea what could cause it?

When you run kerberos-run-command analytics-ml yarn logs -appOwner analytics-ml -applicationId application_1754906949114_424446, it populates your personal kerberos cache file:

ozge@stat1010:~$ klist
Ticket cache: FILE:/tmp/krb5cc_48958
Default principal: analytics-ml/stat1010.eqiad.wmnet@WIKIMEDIA

Valid starting       Expires              Service principal
08/28/2025 07:56:19  08/30/2025 07:56:19  krbtgt/WIKIMEDIA@WIKIMEDIA
	renew until 09/04/2025 07:56:19

So either you need to run kerberos-related commands from stat1010 as yourself, and you run the analytics-ml related commands as detailed in https://phabricator.wikimedia.org/T400902#11124365, or you need to specify the KRB5CCNAME env var when running kerberos-run-command analytics-ml to make sure the associated kerberos cache does not end up at the same location as your personal one.

Ex:

ozge@stat1010:~$ KRB5CCNAME=/tmp/krb5cc_analytics-ml kerberos-run-command analytics-ml yarn logs -appOwner analytics-ml -applicationId application_1754906949114_424446
<snip>
ozge@stat1010:~$ klist
klist: No credentials cache found (filename: /tmp/krb5cc_48958)
ozge@stat1010:~$ klist -f /tmp/krb5cc_analytics-ml
Ticket cache: FILE:/tmp/krb5cc_analytics-ml
Default principal: analytics-ml/stat1010.eqiad.wmnet@WIKIMEDIA

Valid starting       Expires              Service principal
08/28/2025 07:58:54  08/30/2025 07:58:54  krbtgt/WIKIMEDIA@WIKIMEDIA
	renew until 09/04/2025 07:58:54, Flags: FPRIA

Thank you @brouberol ,

I think we can close this task.