Page MenuHomePhabricator

/wmf/data/raw should be readable by analytics-privatedata-users
Closed, ResolvedPublic

Description

We recently changed the default umask in HDFS, which causes files to not be world readable. While importing data from Kafka, Camus writes into a working directories in /wmf/camus. Since files inherit the ownership of their parent dirs, these dirs need to be chgrp owned to analytics-privatedata-users, otherwise once the data is finalized by moving it into /wmf/raw, it will only be readable by analytics.

Event Timeline

For the record:
I had in mind that this data not being available to analytics-privatedata-user group was made on prupose, as users should access the refined version of the table.
I understand however that this has limitations for one-offs as the one we did for ATS-kafka.

Hm, I think we should encourage folks to use refined data, but the raw stuff should still be readable. It doesn't have any more privacy implications, and it will be useful to allow others to do data quality analysis too.

Ottomata triaged this task as High priority.
Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.
hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus

Could also do the files in /wmf/data/raw, but the directories there are all correct afaict. Will just wait until the poorly chgrped files are deleted, (unless someone needs access sooner).

Mentioned in SAL (#wikimedia-analytics) [2021-03-18T19:02:43Z] <ottomata> hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396

Checked some recently imported raw data and it is readable by analytics-privatedata-users.