While doing some debugging, we found out an oddity for the Airflow artifacts of the ML team:
$ pwd /mnt/hdfs/wmf/cache/artifacts/airflow xcollazo@stat1011:/mnt/hdfs/wmf/cache/artifacts/airflow$ ls -lsha total 48K 4.0K drwxrwxr-x 12 analytics analytics-privatedata-users 4.0K Jul 16 08:38 . 4.0K drwxrwxr-x 3 analytics analytics-privatedata-users 4.0K Feb 10 2022 .. 4.0K drwxrwx--- 113 analytics analytics-privatedata-users 4.0K Jul 29 14:29 analytics 4.0K drwxrwx--- 21 analytics-product analytics-privatedata-users 4.0K Jul 8 18:27 analytics_product 4.0K drwxr-x--- 28 analytics analytics-privatedata-users 4.0K Jul 18 18:31 analytics_test 4.0K drwxrwx--- 2 analytics analytics-privatedata-users 4.0K Feb 28 17:17 main 4.0K drwxrwx--- 15 kevinbazira analytics-privatedata-users 4.0K Jul 30 17:32 ml <<<<<<<<<<<<<<<<<<<< 4.0K drwxr-x--- 3 ozge analytics-privatedata-users 4.0K Jul 16 08:38 mlozge <<<<<<<<<<<<<<<<<<<< 4.0K drwxrwx--- 80 analytics-platform-eng analytics-privatedata-users 4.0K Jul 9 10:53 platform_eng 4.0K drwxrwx--- 12 analytics-research analytics-privatedata-users 4.0K Jul 16 08:34 research 4.0K drwxrwx--- 58 analytics-search analytics-privatedata-users 4.0K Jul 16 20:18 search 4.0K drwxrwx--- 4 analytics-wmde analytics-privatedata-users 4.0K Dec 19 2023 wmde
Normally these folders should be owned by a system user, in this case perhaps analytics-ml.
@kevinbazira explains:
We created /wmf/cache/artifacts/airflow/ml since we had to manually upload artifacts before this was automated with blunderbuss.
Although this was a good temporary solution, we should give the ML team a proper service user to own the Airflow assets, as well as future assets like ML owned hive and Iceberg tables.
In this task we should:
- Create an analytics-ml service user for the ML team
- Make sure all existing members of the ML team are part of a group like analytics-ml-users so that they can control their analytics assets.
- Remove the presumably temporary HDFS folder /wmf/cache/artifacts/airflow/mlozge.
- Change ownership of /wmf/cache/artifacts/airflow/ml to the new analytics-ml service user.
- Make sure existing and future Airflow jobs run as the new analytics-ml service user.