Page MenuHomePhabricator

Refactor Analytics POSIX groups in puppet to improve maintainability
Closed, ResolvedPublic8 Estimated Story Points

Description

There are currently multiple POSIX groups in data.yaml related to Analytics:

  • analytics-users
  • statistics-user
  • statistics-admins
  • analytics-wmde-users
  • statistics-privatedata-users
  • researchers
  • analytics-privatedata-users

https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups

The groups are meant to do two things:

  1. Allow access to some data (on HDFS or not).
  2. Allow access to some stat/notebook hosts.

In T243934 we'd like to use a single puppet role/configuration for the stat100X nodes, and eventually deprecate the notebook100x ones (folding their functionalities into the stat100x hosts). The idea is the following:

  1. Reduce the number of POSIX groups to: analtyics-users, analytics-wmde-users and analytics-privatedata-users
  2. Set the above three groups as admin groups for the stat100x roles.
  3. Move users from other groups into one of the above, depending on the use case.

About 3), a simple way to do it would be to fold all users of statistics-user, statistics-privatedata-users and researchers into analytics and leave the ones in analytics-privatedata-users untouched.

In order to provide better protection of PII data on shared client environments, we'd also like to deploy a script that runs in a systemd timer every X minutes that chmods/chowns home directories of analytics-privatedata-users to 750/$user:analytics-privatedata-users. This would allow a little bit more protection of PII data downloaded from Hadoop to localhost without proper file permissions (difficult to check for us given the amount of users/files created every day), since only members of analytics-privatedata-users would be able to read each others dir files. Locking all homes to 700 could also be possible, but some issues might arise:

  • people sometimes need to exchange files etc.. and might copy PII data to /tmp or similar as workaround, with the high risk of forgetting to delete files from there.
  • people in analytics-privatedata-users can sudo to the analytics-privatedata user's Kerberos Keytab for long running jobs, and if they use it from their home directory it might be a problem (access denied by the 700).
Details of each POSIX group
  • analytics-users

Access to stat1004/notebook100x without any other meaningful permission. Most of the users in there (very few) are already in researchers or analytics-privatedata-users.

  • statistics-users

Access to stat1006, and possibly to the Eventlogging backup files (containing PII data).

  • statistics-admins

Old group that should not be needed anymore, so I propose to drop it.

  • analytics-wmde-users

Related to a specific use case for WMDE, I'd keep it for the moment and maybe fold in to another one in the future.

  • statistics-privatedata-users

Access to stat1006 and stat1007, together with read permissions on statistics::mysql_credentials (allowing to query the Analytics mysql dbstore hosts, holding a replica of the wiki dbs).

  • researchers

Access to stat1006 and notebook100x, together with read permissions on statistics::mysql_credentials (allowing to query the Analytics mysql dbstore hosts, holding a replica of the wiki dbs).

analytics-privatedata-users
Access to most of the stat/notebook hosts, plus readability of PII datasets on Hadoop. Note: Kerberos credentials are now needed to use Hadoop, being a member of the group is not the only requirement anymore.

Datasets containing PII data on various hosts
  • statistics::mysql_credentials - the define is deployed for various groups but all of them are using the research credentials to access mysql dbstore wiki replicas.
  • Eventlogging log archives (stat100[6,7], /srv/log/eventlogging). They are currently readable by users able to ssh to the hosts in which they are, probably we'd need to restrict readability to analytics-privatedata-users before the refactor.
  • MW api logs (stat1007, /srv/log/mw-log/archive). Same thing as above.
  • Hadoop PII datasets (readable by hosts with Hadoop packages installed, and by users with Kerberos principals).

We could introduce the following convention for analytics-privatedata-users:

  • belonging to the group would allow readability of Eventlogging/MW/etc.. datasets containing PII data.
  • having also the Kerberos Principal would allow to query more PII datasets on Hadoop, but it will not be mandatory for a user. We already have a way to track principals in data.yaml.

@Ottomata please amend if I wrote anything not correct :)

Event Timeline

Looks good to me, what does the "Set the above two groups as admin groups for the stat100x roles." refer to? There are three groups listed above.

Looks good to me, what does the "Set the above two groups as admin groups for the stat100x roles." refer to? There are three groups listed above.

Fixing it, I meant to write 3 :)

Ah, sorry for commenting on an outdated version!

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

There are other use cases for people using the stat boxes, that often don't involve private data at all. One example could be to work on GPUs with tensorflow and public data, to train a model. Maybe in the future we could think about private only, but for the moment it seems that we'd cut off some important use cases :)

Hmm, I see! In practice, I imagine this only applies to users who already have production access for another reason. If a person asked for production access solely for something like GPU access, I think they would be denied even if they were trusted!

Maybe you're thinking about, say, a software engineer who has cluster access for deployments but not for private data access and wants to train models using public data on the Analytics clients? That seems fairly reasonable, although also a bit niche: not all WMF software engineers even have production access in the first place, although based on some spot checks it does look like most do.

==== Details of each POSIX group ====
[...]
analytics-wmde-users: Related to a specific use case for WMDE, I'd keep it for the moment and maybe fold in to another one in the future.

It sounds the use case here is shared responsibility for WMDE's analytics jobs; if so, that seems quite related to T230743. Maybe we can generalize it now? It could be a bit different because we were thinking about a single user and this is a group, but perhaps a "team" user like that needs its own group.

Ah, sorry for commenting on an outdated version!

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

There are other use cases for people using the stat boxes, that often don't involve private data at all. One example could be to work on GPUs with tensorflow and public data, to train a model. Maybe in the future we could think about private only, but for the moment it seems that we'd cut off some important use cases :)

Hmm, I see! In practice, I imagine this only applies to users who already have production access for another reason. If a person asked for production access solely for something like GPU access, I think they would be denied even if they were trusted!

I don't follow your line of thought, why should we deny access to hosts if there is a valid use case? The GPU one is only the first one that came to mind, but there are others. Ideally we should encourage people to use public datasets, and WMCS might not be the best place for every use case (resource constraints, GPUs, etc..).

Maybe you're thinking about, say, a software engineer who has cluster access for deployments but not for private data access and wants to train models using public data on the Analytics clients? That seems fairly reasonable, although also a bit niche: not all WMF software engineers even have production access in the first place, although based on some spot checks it does look like most do.

Nope I am thinking to somebody that wants to use stat boxes for Analytics purposes, but maybe not concerning privatedata only. Again working with public data is an example. Can you share your concerns? I am not getting what is the problem :)

==== Details of each POSIX group ====
[...]
analytics-wmde-users: Related to a specific use case for WMDE, I'd keep it for the moment and maybe fold in to another one in the future.

It sounds the use case here is shared responsibility for WMDE's analytics jobs; if so, that seems quite related to T230743. Maybe we can generalize it now? It could be a bit different because we were thinking about a single user and this is a group, but perhaps a "team" user like that needs its own group.

I think that we are going a bit out of scope - this task is only to refactor what it is already there, not to add more groups. We already have a similar group for the search team, that goes with a dedicated user (analytics-search) and related keytab credentials. The users in the related group can sudo as this user, more or less how users in analytics-privatedata-users can sudo as the user analytics-privatedata and get access to its kerberos credential. I am ok to create a similar thing for Product Analytics, but I'd like to avoid to use a single identity for all teams (that are not analytics). Let's discuss your use case in the related task :)

Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata

Do you mean analytics-users and analytics-privatedata-users?

+1 to everything if so!

Reduce the number of POSIX groups to: analtyics, analytics-wmde-users and analytics-privatedata

Do you mean analytics-users and analytics-privatedata-users?

+1 to everything if so!

yes correct sorry! Fixing :)

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

There are other use cases for people using the stat boxes, that often don't involve private data at all. One example could be to work on GPUs with tensorflow and public data, to train a model. Maybe in the future we could think about private only, but for the moment it seems that we'd cut off some important use cases :)
[...]
I don't follow your line of thought, why should we deny access to hosts if there is a valid use case? The GPU one is only the first one that came to mind, but there are others. Ideally we should encourage people to use public datasets, and WMCS might not be the best place for every use case (resource constraints, GPUs, etc..).
[...]
Nope I am thinking to somebody that wants to use stat boxes for Analytics purposes, but maybe not concerning privatedata only. Again working with public data is an example. Can you share your concerns? I am not getting what is the problem :)

I don't have any serious concerns, so you don't have to pay me too much attention; I joined the discussion to raise a different point and now I'm mostly continuing out of curiosity 😊

That said, my point is that giving people production cluster access is a big inherent security risk. We have to do it, obviously, but the guiding principle (as I understand it) is that we limit it to those who have a clear, ongoing need for it. Giving someone access to the production cluster just so they can use a GPU or a JupyterHub instance seems like it fails that test; if they need the hardware or software, but not the private data, there are plenty of other ways they can get it without the added risk to the site and the private data.

I think that we are going a bit out of scope - this task is only to refactor what it is already there, not to add more groups. We already have a similar group for the search team, that goes with a dedicated user (analytics-search) and related keytab credentials. The users in the related group can sudo as this user, more or less how users in analytics-privatedata-users can sudo as the user analytics-privatedata and get access to its kerberos credential. I am ok to create a similar thing for Product Analytics, but I'd like to avoid to use a single identity for all teams (that are not analytics). Let's discuss your use case in the related task :)

Makes sense! I wasn't sure how exactly it would work, but I thought there might be some opportunity for standardization. Thanks for explaining why not!

I don't have any serious concerns, so you don't have to pay me too much attention; I joined the discussion to raise a different point and now I'm mostly continuing out of curiosity 😊

And you are more than welcome, please keep doing so :) My point was only to provide context about suggestions etc.., otherwise it is difficult to judge if they are blockers/hard-requirements!

I tried to add a script that periodically enforces proper home dir permissions, but I had to revert since the admin module is of course already doing this job. I will try to find a way to add this functionality to the module itself, even if it is not straighforward.

On another note, we are close to move stat1006 to role::statistics::explorer. The main blockers are the following groups:

  • statistics-privatedata-users
  • statistics-admins
  • researchers

Those are not deployed on explorer, I'd like to see if we can deprecate/merge them before proceeding.

Change 576845 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate two old analytics posix groups

https://gerrit.wikimedia.org/r/576845

Change 576845 merged by Elukey:
[operations/puppet@production] admin: deprecate two old analytics posix groups

https://gerrit.wikimedia.org/r/576845

I checked all the statistics-users members and all except two are already in other groups (analytics-privatedata-users, statistics-privatedata-users, researchers): brion, mhurd. We could reach out to them and ask if they want to move to another group or simply be removed from it.

Change 578488 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate statistics-users

https://gerrit.wikimedia.org/r/578488

Change 578500 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate statistics-privatedata-users

https://gerrit.wikimedia.org/r/578500

Change 578488 merged by Elukey:
[operations/puppet@production] admin: deprecate statistics-users

https://gerrit.wikimedia.org/r/578488

Change 578500 merged by Elukey:
[operations/puppet@production] admin: deprecate statistics-privatedata-users

https://gerrit.wikimedia.org/r/578500

Next steps:

  • decide what to do with the researchers posix group (fold it in analytics-privatedata-users, etc..)

Change 579228 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: simplify and document some analytics posix groups

https://gerrit.wikimedia.org/r/579228

Change 579228 merged by Elukey:
[operations/puppet@production] admin: simplify and document some analytics posix groups

https://gerrit.wikimedia.org/r/579228

elukey set the point value for this task to 8.Mar 17 2020, 9:10 AM
elukey added a project: Analytics-Kanban.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.

Change 581983 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: refactor analytics-related groups and add documentation

https://gerrit.wikimedia.org/r/581983

Change 581983 abandoned by Elukey:
admin: refactor analytics-related groups and add documentation

Reason:
breaking the changes in multiple pieces

https://gerrit.wikimedia.org/r/581983

Change 582064 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: add more documentation to analytics posix groups

https://gerrit.wikimedia.org/r/582064

Change 582066 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate piwik-roots

https://gerrit.wikimedia.org/r/582066

Change 582068 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: flag notebook-roots as deprecated

https://gerrit.wikimedia.org/r/582068

Change 582070 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate the eventlogging-roots group

https://gerrit.wikimedia.org/r/582070

Change 582071 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: use the *analytics_admins_members placeholder when possible

https://gerrit.wikimedia.org/r/582071

Change 582072 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] admin: deprecate aqs-users

https://gerrit.wikimedia.org/r/582072

Change 582440 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::sites::analytics: remove reference to statistics-web-users

https://gerrit.wikimedia.org/r/582440

Change 582440 merged by Elukey:
[operations/puppet@production] statistics::sites::analytics: remove reference to statistics-web-users

https://gerrit.wikimedia.org/r/582440

Change 582064 merged by Elukey:
[operations/puppet@production] admin: add more documentation to analytics posix groups

https://gerrit.wikimedia.org/r/582064

Change 582066 merged by Elukey:
[operations/puppet@production] admin: deprecate piwik-roots

https://gerrit.wikimedia.org/r/582066

Change 582068 merged by Elukey:
[operations/puppet@production] admin: flag notebook-roots as deprecated

https://gerrit.wikimedia.org/r/582068

Change 582070 merged by Elukey:
[operations/puppet@production] admin: refactor eventlogging-related groups

https://gerrit.wikimedia.org/r/582070

Change 582071 merged by Elukey:
[operations/puppet@production] admin: use the *analytics_admins_members placeholder when possible

https://gerrit.wikimedia.org/r/582071

Change 582072 merged by Elukey:
[operations/puppet@production] admin: deprecate aqs-users

https://gerrit.wikimedia.org/r/582072