Page MenuHomePhabricator

Airflow-triggered Spark-jobs produce hdfs-files belonging to the wrong hdfs-user-group
Closed, ResolvedPublic

Description

Here is an hdfs listing of files in a partition created by Oozie:

aqu@stat1004:~$ hdfs dfs -ls -h /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14 | head                                                                                       
Found 513 items
-rw-r-----   3 analytics analytics-privatedata-users          0 2022-02-25 01:58 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/_PARTITIONED                                
-rwxr-x---   3 analytics analytics-privatedata-users    263.0 M 2022-02-25 01:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00000-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    266.1 M 2022-02-25 01:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00001-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    265.6 M 2022-02-25 01:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00002-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    263.9 M 2022-02-25 01:55 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00003-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    264.6 M 2022-02-25 01:54 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00004-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    265.2 M 2022-02-25 01:54 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00005-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    264.6 M 2022-02-25 01:54 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00006-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000
-rwxr-x---   3 analytics analytics-privatedata-users    264.3 M 2022-02-25 01:54 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-14/part-00007-4b8ace4f-d908-4272-8dfe-7bda5ee03198.c000

Here are some results with an Airflow-triggered spark-jobs:

aqu@stat1004:~$ hdfs dfs -ls -h /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21 | head                                                                                       
Found 512 items
-rw-r-----   3 analytics hdfs    269.6 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00000-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    269.4 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00001-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    268.9 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00002-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    269.4 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00003-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    268.8 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00004-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    270.5 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00005-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    269.4 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00006-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    269.4 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00007-173e32e5-81fe-4374-83ec-380bcb12d107.c000               
-rw-r-----   3 analytics hdfs    269.8 M 2022-03-07 11:00 /wmf/data/wmf/wikidata/entity/snapshot=2022-02-21/part-00008-173e32e5-81fe-4374-83ec-380bcb12d107.c000

You can see that the hdfs user-group is now hdfs in place of analytics-privatedata-users. I think the consequence is a lack of access to those files for the wmf-internal-users.

Slack original thread: https://wikimedia.slack.com/archives/C02291Z9YQY/p1646402078889739