Page MenuHomePhabricator

Decide how to make datasets owned by analytics-search-users also readable by analytics-privatedata-users
Closed, ResolvedPublic2 Estimated Story Points

Description

Some of the data generated and owned by analytics-search-users might be of some value to others, esp. if we're starting to advertise them via datahub (T374118).

Search dataset containing no PII should definitely world readable but the ones containing PII should probably be readable only by analytics-search-users and analytics-privatedata-users.

Possible ideas:

Event Timeline

I think everyone in analytics-search-users is in analytics-privatedata-users (otherwise no hadoop access).

Can you just chgrp these all to analytics-privatedata-users?

The reason we originally started using different users/groups was to silo permissions.
I think it's still good to have different writing abilities and read abilities: only search-user can write to this datasets, but we wish all private-data-users to be able to read.

For future files created by spark:

  • we set spark.hadoop.fs.permissions.umask-mode: 0022 for explicitly public data. We can set that to 0027 everywhere else and spark should create the files with expected permissions
  • Per hdfs documentation the group of created files and directories is set by the parent directory group. Testing with the pyspark shell this appears to be the case.

For files already in hdfs:

  • In a quick test it appears that hdfs dfs -chmod -R will apply the same permissions to files and directories, but we want 750 on dirs and 640 on files. So we can't use the native recursive chmod directly.
  • hdfs dfs -find does not support -type d or -type f.

The following commands use an alias to become analytics-search:

  • alias krc='sudo -u analytics-search kerberos-run-command analytics-search

Fixing ownership:

/mnt/hdfs/$ krc hdfs dfs -chown -R analytics-search:analytics-privatedata-users /wmf/data/discovery

Fixing current directory permissions:
Updated permissions: drwxr-x---

/mnt/hdfs/$ find wmf/data/discovery/ \
    -type d -print0 \
    | xargs -0 printf -- '/%s\0' \
    | krc xargs -0 hdfs dfs -chmod 750

Fixing current file permissions. In theory we could probably skip this and let the directories handle access, but seems best to make it all nice and tidy:
Updated permissions: -rw-r-----

/mnt/hdfs/$ find wmf/data/discovery/ \
    -type d -print0 \
    | xargs -0 printf -- '/%s\0' \
    | krc xargs -0 hdfs dfs -chmod 640

Then I had to go back and make the directories/files we explicitly mark public (via umask of 0022) back to public, along with their parent directories up to /wmf/data/discovery:

  • /wmf/data/discovery/cirrus/index
  • /wmf/data/discovery/cirrus/index_without_content
  • /wmf/data/discovery/query_service
  • /wmf/data/discovery/wikidata
  • /wmf/data/discovery/wdqs
  • /wmf/data/discovery/wcqs

Via the commands:
Updated permissions: drwxr-xr-x

/mnt/hdfs$ find wmf/data/discovery/{cirrus/index,cirrus/index_without_content,query_service,wikidata,wdqs,wcqs} \
    -type d -print0 \
    | xargs -0 printf -- '/%s\0' \
    | krc xargs -0 hdfs dfs -chmod 755

Updated permissions: -rw-r--r--

/mnt/hdfs$ find wmf/data/discovery/{cirrus/index,cirrus/index_without_content,query_service,wikidata,wdqs,wcqs} \
    -type f -print0 \
    | xargs -0 printf -- '/%s\0' \
    | krc xargs -0 hdfs dfs -chmod 644
dr0ptp4kt triaged this task as Medium priority.
dr0ptp4kt set the point value for this task to 2.
dr0ptp4kt moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.