We recently changed the default umask in HDFS, which causes files to not be world readable. While importing data from Kafka, Camus writes into a working directories in /wmf/camus. Since files inherit the ownership of their parent dirs, these dirs need to be chgrp owned to analytics-privatedata-users, otherwise once the data is finalized by moving it into /wmf/raw, it will only be readable by analytics.
Description
Event Timeline
For the record:
I had in mind that this data not being available to analytics-privatedata-user group was made on prupose, as users should access the refined version of the table.
I understand however that this has limitations for one-offs as the one we did for ATS-kafka.
Hm, I think we should encourage folks to use refined data, but the raw stuff should still be readable. It doesn't have any more privacy implications, and it will be useful to allow others to do data quality analysis too.
hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus
Could also do the files in /wmf/data/raw, but the directories there are all correct afaict. Will just wait until the poorly chgrped files are deleted, (unless someone needs access sooner).
Mentioned in SAL (#wikimedia-analytics) [2021-03-18T19:02:43Z] <ottomata> hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396
Checked some recently imported raw data and it is readable by analytics-privatedata-users.