Page MenuHomePhabricator

Unify puppet roles for stat and notebook hosts
Closed, ResolvedPublic

Description

We have currently a wide range of puppet roles for Analytics clients:

  • stat1004 (role::statistics::explorer) - generic Hadoop client node, terabytes of disk space for users
  • stat1005 (role::statistics::explorer::gpu) - generic Hadoop client node, terabytes of disk space for users, GPU (card + drivers + tools + etc..)
  • stat1006 (role::statistics::cruncher) - generic data crunching node, no Hadoop client config deployed, terabytes of disk space for users, access to Eventlogging data, runs report updater jobs via systemd timers
  • stat1007 (role::statistics::private) - generic Hadoop client node, terabytes of disk space for users, Report updater jobs running via systemd timers, geoip backup systemd timer.
  • notebook100[3,4] (role::swap) - Jupyter Notebook hosts, low space on disk for users, originally meant to be an alternative way to access Hadoop/HDFS without storing any data on the host itself.

After the introduction of Kerberos, the differences between stat100[4,5,6,7] are not that much, so we could think about refactoring all roles into one. Open questions:

  • where do we put Report Updater jobs, since other users need to access it? Should we deploy them only on some hosts if configured via puppet or hiera?
  • where do we run analytics only timers/jobs?

Moreover, more people are using notebooks and they asked more space on the hosts to use them also for local computations (so not only as Hadoop clients). We could think about unifying stat-related roles with role::swap, so every stat box would have also a jupyter server on it, and drop support for notebook hosts.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+19 -15
operations/puppetproduction+3 -1
operations/puppetproduction+2 -1
operations/puppetproduction+0 -4
operations/puppetproduction+11 -56
operations/puppetproduction+16 -9
operations/puppetproduction+9 -11
operations/puppetproduction+8 -10
operations/puppetproduction+0 -9
operations/puppetproduction+2 -6
operations/puppetproduction+40 -0
operations/puppetproduction+2 -0
operations/puppetproduction+5 -0
operations/puppetproduction+8 -0
operations/puppetproduction+123 -67
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+0 -28
operations/puppetproduction+26 -21
operations/puppetproduction+7 -0
operations/puppetproduction+42 -41
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/puppetproduction+6 -0
operations/puppetproduction+20 -0
operations/puppetproduction+31 -2
operations/puppetproduction+20 -15
operations/puppetproduction+16 -83
operations/puppetproduction+0 -3
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

High level idea about a way to simplify https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups:

  1. since we have kerberos authentication, I'd propose to just decide a set of groups to deploy to all stat boxes.
  2. research and statistics-privatedata-users seems to overlap a lot in terms of what they grant, should we just deprecate research and add the missing members to statistics-privatedata-users
  3. analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
  4. statistics-users may be kept, even if few users of it are not in any privatedata group.

So to recap the proposal, we'd just keep three groups:

  • statistics-users - access to all stat boxes, no privatedata of any sort
  • statistics-privatedata-users - access to all stat boxes, mysql wiki shards plus some privatedata logs
  • analytics-privatedata-users - access to Hadoop + privatedata + all stat boxes

Corner case: user without privatedata permissions able to read PII data downloaded from a user with privatedata privileges on his/her home directory without proper permissions. We could try to enforce rules for home dir permissions in theory..

should we just deprecate research and add the missing members to statistics-privatedata-users
analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
statistics-users may be kept, even if few users of it are not in any privatedata group.

I'm fine with both of these ideas, but here's another. Should we just merge statistics-privatedata-users and research and statistics-users into analytics-users? The statisics-privatedata-users stuff was about access to files stored locally on stat1007. Some of those still exist: eventlogging logs. I'm not sure they are actually needed? Even if they are, we can chown them with analytics-privatedata-users and just use that access group like we do in HDFS to restrict access.

Then we'd just have:

  • analytics-users - all stat boxes + mysql analytics dbs
  • analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

Even better yes, I thought that some use cases were still to be supported (I always misremember stat-related stuff). Two groups will be way better!

Change 574032 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a new Analytics role to an-launcher1001

https://gerrit.wikimedia.org/r/574032

Change 574032 merged by Elukey:
[operations/puppet@production] Add a new Analytics role to an-launcher1001

https://gerrit.wikimedia.org/r/574032

Change 574038 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add Hadoop common hiera configuration

https://gerrit.wikimedia.org/r/574038

Change 574038 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add Hadoop common hiera configuration

https://gerrit.wikimedia.org/r/574038

Change 574042 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos and base profiles

https://gerrit.wikimedia.org/r/574042

Change 574042 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos and base profiles

https://gerrit.wikimedia.org/r/574042

Change 574289 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add Analytics Refinery scap repo

https://gerrit.wikimedia.org/r/574289

Change 574289 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add Analytics Refinery scap repo

https://gerrit.wikimedia.org/r/574289

Change 574379 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add git proxy config for Analytics vlan

https://gerrit.wikimedia.org/r/574379

Change 574379 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add git proxy config for Analytics vlan

https://gerrit.wikimedia.org/r/574379

Change 574385 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add hdfs RU jobs

https://gerrit.wikimedia.org/r/574385

Change 574385 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add hdfs RU jobs

https://gerrit.wikimedia.org/r/574385

Change 574722 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move all Report Updater Jobs to an-launcher1001

https://gerrit.wikimedia.org/r/574722

Change 574780 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos settings for hive

https://gerrit.wikimedia.org/r/574780

Change 574780 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos settings for hive

https://gerrit.wikimedia.org/r/574780

Change 574786 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] reportupdate::job: use kerberos when needed

https://gerrit.wikimedia.org/r/574786

Change 574786 merged by Elukey:
[operations/puppet@production] reportupdate::job: use kerberos when needed

https://gerrit.wikimedia.org/r/574786

Change 574795 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::private: remove rsync to /mnt/hdfs

https://gerrit.wikimedia.org/r/574795

Change 574795 merged by Elukey:
[operations/puppet@production] role::statistics::private: remove rsync to /mnt/hdfs

https://gerrit.wikimedia.org/r/574795

Change 574843 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-launcher1001 to the list of statistics servers

https://gerrit.wikimedia.org/r/574843

Change 574843 merged by Elukey:
[operations/puppet@production] Add an-launcher1001 to the list of statistics servers

https://gerrit.wikimedia.org/r/574843

Change 575048 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-launcher1001 to profile::dumps::distribution

https://gerrit.wikimedia.org/r/575048

Change 575048 merged by Elukey:
[operations/puppet@production] Add an-launcher1001 to profile::dumps::distribution

https://gerrit.wikimedia.org/r/575048

Change 574722 merged by Elukey:
[operations/puppet@production] Move all Report Updater Jobs to an-launcher1001

https://gerrit.wikimedia.org/r/574722

Change 575470 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_mediawiki_dumps timers from stat1007 to an-launcher1001

https://gerrit.wikimedia.org/r/575470

Change 575470 merged by Elukey:
[operations/puppet@production] Move import_mediawiki_dumps timers from stat1007 to an-launcher1001

https://gerrit.wikimedia.org/r/575470

Change 575476 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_wikidata_entities_dumps timers to an-launcher1001

https://gerrit.wikimedia.org/r/575476

Change 575476 merged by Elukey:
[operations/puppet@production] Move import_wikidata_entities_dumps timers to an-launcher1001

https://gerrit.wikimedia.org/r/575476

Change 575488 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add statistics xml dataset mounts

https://gerrit.wikimedia.org/r/575488

Change 575488 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add statistics xml dataset mounts

https://gerrit.wikimedia.org/r/575488

elukey changed the task status from Open to Stalled.Mar 1 2020, 4:05 PM

The next step now is to complete T246578, setting this task to stalled in the meantime.

Then we'd just have:

  • analytics-users - all stat boxes + mysql analytics dbs
  • analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

I'm very excited about the simplifications you're planning here!

However, this naming convention would imply that analytics-users doesn't grant access to sensitive data, even though it would include access to the internal wiki replicas, which include:

  • editor IP addresses in the cu_changes table
  • editor email addresses in the users table
  • revision-deleted usernames and comments in the archive and revision tables

So to recap the proposal, we'd just keep three groups:

  • statistics-users - access to all stat boxes, no privatedata of any sort

This isn't misleading, so that's good! If you don't have access to internal Hadoop or MariaDB or whatever private logs there are, you really don't have any private data access. However, if we don't want to give a user any private data, why give them production access in the first place? Cloud Services provides plenty of tools for analysis that doesn't rely on private data.

  1. analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
  2. statistics-users may be kept, even if few users of it are not in any privatedata group.

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

Then we'd just have:

  • analytics-users - all stat boxes + mysql analytics dbs
  • analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

I'm very excited about the simplifications you're planning here!

However, this naming convention would imply that analytics-users doesn't grant access to sensitive data, even though it would include access to the internal wiki replicas, which include:

  • editor IP addresses in the cu_changes table
  • editor email addresses in the users table
  • revision-deleted usernames and comments in the archive and revision tables

You are right, the above simplification was only an initial idea, the final proposal is in T246578. Indeed we are going to limit access to the wiki replicas to analytics-privatedata-users :)

So to recap the proposal, we'd just keep three groups:

  • statistics-users - access to all stat boxes, no privatedata of any sort

This isn't misleading, so that's good! If you don't have access to internal Hadoop or MariaDB or whatever private logs there are, you really don't have any private data access. However, if we don't want to give a user any private data, why give them production access in the first place? Cloud Services provides plenty of tools for analysis that doesn't rely on private data.

  1. analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
  2. statistics-users may be kept, even if few users of it are not in any privatedata group.

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

There are other use cases for people using the stat boxes, that often don't involve private data at all. One example could be to work on GPUs with tensorflow and public data, to train a model. Maybe in the future we could think about private only, but for the moment it seems that we'd cut off some important use cases :)

Change 576384 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Ensure readability settings for home dirs of Analytics clients

https://gerrit.wikimedia.org/r/576384

Change 576384 merged by Elukey:
[operations/puppet@production] Ensure readability settings for home dirs of Analytics clients

https://gerrit.wikimedia.org/r/576384

Change 577278 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add xmldumps to stat100[4,5]

https://gerrit.wikimedia.org/r/577278

Change 577297 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove statistics-admins and statistics-web-admins from Analytics

https://gerrit.wikimedia.org/r/577297

Change 577297 merged by Elukey:
[operations/puppet@production] Remove statistics-admins and statistics-web-admins from Analytics

https://gerrit.wikimedia.org/r/577297

Change 577309 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove profiles from stat100[6,7]'s roles not used anymore

https://gerrit.wikimedia.org/r/577309

Change 577309 merged by Elukey:
[operations/puppet@production] Remove profiles from stat100[6,7]'s roles not used anymore

https://gerrit.wikimedia.org/r/577309

Change 578481 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::compute: deploy mysql credentials only when needed

https://gerrit.wikimedia.org/r/578481

Change 578481 merged by Elukey:
[operations/puppet@production] statistics::compute: deploy mysql credentials only when needed

https://gerrit.wikimedia.org/r/578481

Change 578483 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::mysql_credentials: use require instead of defined

https://gerrit.wikimedia.org/r/578483

Change 578483 merged by Elukey:
[operations/puppet@production] statistics::mysql_credentials: use require instead of defined

https://gerrit.wikimedia.org/r/578483

Change 578484 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce profile::statistics::eventlogging_rsync

https://gerrit.wikimedia.org/r/578484

Change 578484 merged by Elukey:
[operations/puppet@production] Introduce profile::statistics::eventlogging_rsync

https://gerrit.wikimedia.org/r/578484

Change 578535 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move stat1006 to role::statistics::explorer

https://gerrit.wikimedia.org/r/578535

Change 578535 merged by Elukey:
[operations/puppet@production] Move stat1006 to role::statistics::explorer

https://gerrit.wikimedia.org/r/578535

Change 578541 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer: remove analytics keytab

https://gerrit.wikimedia.org/r/578541

Change 578541 merged by Elukey:
[operations/puppet@production] role::statistics::explorer: remove analytics keytab

https://gerrit.wikimedia.org/r/578541

Ok up to now stat100[4,5,6] have been unified under a single role, role::statistics::explorer. Jupyterhub was also added as well.

Next steps:

  • move stat1007 to role::statistics::explorer
  • decommission notebook100[3,4]
elukey changed the task status from Stalled to Open.Mar 10 2020, 3:47 PM

Change 578783 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Follow up after Analytics client host refactoring

https://gerrit.wikimedia.org/r/578783

Change 578783 merged by Elukey:
[operations/puppet@production] Follow up after Analytics client host refactoring

https://gerrit.wikimedia.org/r/578783

Change 577278 merged by Elukey:
[operations/puppet@production] Add xmldumps to stat100[4,5]

https://gerrit.wikimedia.org/r/577278

Change 591311 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer: move target hosts to hiera

https://gerrit.wikimedia.org/r/591311

Change 591311 merged by Elukey:
[operations/puppet@production] role::statistics::explorer: move target hosts to hiera

https://gerrit.wikimedia.org/r/591311

Everything is done except decommissioning the notebook hosts, that can be done separately (there is a subtask about it).