There are currently multiple POSIX groups in data.yaml related to Analytics:
- analytics-users
- statistics-user
- statistics-admins
- analytics-wmde-users
- statistics-privatedata-users
- researchers
- analytics-privatedata-users
https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups
The groups are meant to do two things:
- Allow access to some data (on HDFS or not).
- Allow access to some stat/notebook hosts.
In T243934 we'd like to use a single puppet role/configuration for the stat100X nodes, and eventually deprecate the notebook100x ones (folding their functionalities into the stat100x hosts). The idea is the following:
- Reduce the number of POSIX groups to: analtyics-users, analytics-wmde-users and analytics-privatedata-users
- Set the above three groups as admin groups for the stat100x roles.
- Move users from other groups into one of the above, depending on the use case.
About 3), a simple way to do it would be to fold all users of statistics-user, statistics-privatedata-users and researchers into analytics and leave the ones in analytics-privatedata-users untouched.
In order to provide better protection of PII data on shared client environments, we'd also like to deploy a script that runs in a systemd timer every X minutes that chmods/chowns home directories of analytics-privatedata-users to 750/$user:analytics-privatedata-users. This would allow a little bit more protection of PII data downloaded from Hadoop to localhost without proper file permissions (difficult to check for us given the amount of users/files created every day), since only members of analytics-privatedata-users would be able to read each others dir files. Locking all homes to 700 could also be possible, but some issues might arise:
- people sometimes need to exchange files etc.. and might copy PII data to /tmp or similar as workaround, with the high risk of forgetting to delete files from there.
- people in analytics-privatedata-users can sudo to the analytics-privatedata user's Kerberos Keytab for long running jobs, and if they use it from their home directory it might be a problem (access denied by the 700).
Details of each POSIX group
- analytics-users
Access to stat1004/notebook100x without any other meaningful permission. Most of the users in there (very few) are already in researchers or analytics-privatedata-users.
- statistics-users
Access to stat1006, and possibly to the Eventlogging backup files (containing PII data).
- statistics-admins
Old group that should not be needed anymore, so I propose to drop it.
- analytics-wmde-users
Related to a specific use case for WMDE, I'd keep it for the moment and maybe fold in to another one in the future.
- statistics-privatedata-users
Access to stat1006 and stat1007, together with read permissions on statistics::mysql_credentials (allowing to query the Analytics mysql dbstore hosts, holding a replica of the wiki dbs).
- researchers
Access to stat1006 and notebook100x, together with read permissions on statistics::mysql_credentials (allowing to query the Analytics mysql dbstore hosts, holding a replica of the wiki dbs).
analytics-privatedata-users
Access to most of the stat/notebook hosts, plus readability of PII datasets on Hadoop. Note: Kerberos credentials are now needed to use Hadoop, being a member of the group is not the only requirement anymore.
Datasets containing PII data on various hosts
- statistics::mysql_credentials - the define is deployed for various groups but all of them are using the research credentials to access mysql dbstore wiki replicas.
- Eventlogging log archives (stat100[6,7], /srv/log/eventlogging). They are currently readable by users able to ssh to the hosts in which they are, probably we'd need to restrict readability to analytics-privatedata-users before the refactor.
- MW api logs (stat1007, /srv/log/mw-log/archive). Same thing as above.
- Hadoop PII datasets (readable by hosts with Hadoop packages installed, and by users with Kerberos principals).
We could introduce the following convention for analytics-privatedata-users:
- belonging to the group would allow readability of Eventlogging/MW/etc.. datasets containing PII data.
- having also the Kerberos Principal would allow to query more PII datasets on Hadoop, but it will not be mandatory for a user. We already have a way to track principals in data.yaml.
@Ottomata please amend if I wrote anything not correct :)