Unify puppet roles for stat and notebook hosts
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Jan 30 2020, 12:58 AM

Description

We have currently a wide range of puppet roles for Analytics clients:

stat1004 (role::statistics::explorer) - generic Hadoop client node, terabytes of disk space for users
stat1005 (role::statistics::explorer::gpu) - generic Hadoop client node, terabytes of disk space for users, GPU (card + drivers + tools + etc..)
stat1006 (role::statistics::cruncher) - generic data crunching node, no Hadoop client config deployed, terabytes of disk space for users, access to Eventlogging data, runs report updater jobs via systemd timers
stat1007 (role::statistics::private) - generic Hadoop client node, terabytes of disk space for users, Report updater jobs running via systemd timers, geoip backup systemd timer.
notebook100[3,4] (role::swap) - Jupyter Notebook hosts, low space on disk for users, originally meant to be an alternative way to access Hadoop/HDFS without storing any data on the host itself.

After the introduction of Kerberos, the differences between stat100[4,5,6,7] are not that much, so we could think about refactoring all roles into one. Open questions:

where do we put Report Updater jobs, since other users need to access it? Should we deploy them only on some hosts if configured via puppet or hiera?
where do we run analytics only timers/jobs?

Moreover, more people are using notebooks and they asked more space on the hosts to use them also for local computations (so not only as Hadoop clients). We could think about unifying stat-related roles with role::swap, so every stat box would have also a jupyter server on it, and drop support for notebook hosts.

Details

Subject	Repo	Branch	Lines +/-
role::statistics::explorer: move target hosts to hiera	operations/puppet	production	+19 -15
Add xmldumps to stat100[4,5]	operations/puppet	production	+3 -1
Follow up after Analytics client host refactoring	operations/puppet	production	+2 -1
role::statistics::explorer: remove analytics keytab	operations/puppet	production	+0 -4
Move stat1006 to role::statistics::explorer	operations/puppet	production	+11 -56
Introduce profile::statistics::eventlogging_rsync	operations/puppet	production	+16 -9
statistics::mysql_credentials: use require instead of defined	operations/puppet	production	+9 -11
statistics::compute: deploy mysql credentials only when needed	operations/puppet	production	+8 -10
Remove profiles from stat100[6,7]'s roles not used anymore	operations/puppet	production	+0 -9
Remove statistics-admins and statistics-web-admins from Analytics	operations/puppet	production	+2 -6
Ensure readability settings for home dirs of Analytics clients	operations/puppet	production	+40 -0
role::analytics_cluster::launcher: add statistics xml dataset mounts	operations/puppet	production	+2 -0
Move import_wikidata_entities_dumps timers to an-launcher1001	operations/puppet	production	+5 -0
Move import_mediawiki_dumps timers from stat1007 to an-launcher1001	operations/puppet	production	+8 -0
Move all Report Updater Jobs to an-launcher1001	operations/puppet	production	+123 -67
Add an-launcher1001 to profile::dumps::distribution	operations/puppet	production	+1 -0
Add an-launcher1001 to the list of statistics servers	operations/puppet	production	+1 -0
role::statistics::private: remove rsync to /mnt/hdfs	operations/puppet	production	+0 -28
reportupdate::job: use kerberos when needed	operations/puppet	production	+26 -21
role::analytics_cluster::launcher: add kerberos settings for hive	operations/puppet	production	+7 -0
role::analytics_cluster::launcher: add hdfs RU jobs	operations/puppet	production	+42 -41
role::analytics_cluster::launcher: add git proxy config for Analytics vlan	operations/puppet	production	+2 -0
role::analytics_cluster::launcher: add Analytics Refinery scap repo	operations/puppet	production	+3 -0
role::analytics_cluster::launcher: add kerberos and base profiles	operations/puppet	production	+6 -0
role::analytics_cluster::launcher: add Hadoop common hiera configuration	operations/puppet	production	+20 -0
Add a new Analytics role to an-launcher1001	operations/puppet	production	+31 -2
Refactor statistics mountpoints to be included in all stat roles	operations/puppet	production	+20 -15
Unify stat1004's and stat1005's roles into one	operations/puppet	production	+16 -83
role::statistics::explorer: remove config from hiera	operations/puppet	production	+0 -3

Related Objects
Search...

Status	Assigned	Task
Resolved	odimitrijevic	T240437 Analytics Ops Technical Debt
Resolved	elukey	T243934 Unify puppet roles for stat and notebook hosts
Resolved	elukey	T244717 Create a ganeti VM in eqiad: an-launcher1001
Resolved	elukey	T245179 Add SWAP profile to stat1005
Resolved	elukey	T246578 Refactor Analytics POSIX groups in puppet to improve maintainability
Resolved	elukey	T249752 Decomission notebook hosts
Resolved	elukey	T249754 Unify stat1007 puppet role with the rest of the stats cluster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Maintenance_bot removed a project: Patch-For-Review.Feb 20 2020, 7:10 PM

High level idea about a way to simplify https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups:

since we have kerberos authentication, I'd propose to just decide a set of groups to deploy to all stat boxes.
research and statistics-privatedata-users seems to overlap a lot in terms of what they grant, should we just deprecate research and add the missing members to statistics-privatedata-users
analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
statistics-users may be kept, even if few users of it are not in any privatedata group.

So to recap the proposal, we'd just keep three groups:

statistics-users - access to all stat boxes, no privatedata of any sort
statistics-privatedata-users - access to all stat boxes, mysql wiki shards plus some privatedata logs
analytics-privatedata-users - access to Hadoop + privatedata + all stat boxes

Corner case: user without privatedata permissions able to read PII data downloaded from a user with privatedata privileges on his/her home directory without proper permissions. We could try to enforce rules for home dir permissions in theory..

should we just deprecate research and add the missing members to statistics-privatedata-users
analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.
statistics-users may be kept, even if few users of it are not in any privatedata group.

I'm fine with both of these ideas, but here's another. Should we just merge statistics-privatedata-users and research and statistics-users into analytics-users? The statisics-privatedata-users stuff was about access to files stored locally on stat1007. Some of those still exist: eventlogging logs. I'm not sure they are actually needed? Even if they are, we can chown them with analytics-privatedata-users and just use that access group like we do in HDFS to restrict access.

Then we'd just have:

analytics-users - all stat boxes + mysql analytics dbs
analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

Even better yes, I thought that some use cases were still to be supported (I always misremember stat-related stuff). Two groups will be way better!

elukey changed the status of subtask T244717: Create a ganeti VM in eqiad: an-launcher1001 from Stalled to Open.Feb 21 2020, 3:57 PM

elukey closed subtask T244717: Create a ganeti VM in eqiad: an-launcher1001 as Resolved.Feb 21 2020, 4:42 PM

Change 574032 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add a new Analytics role to an-launcher1001

https://gerrit.wikimedia.org/r/574032

gerritbot added a project: Patch-For-Review.Feb 21 2020, 4:44 PM

Change 574032 merged by Elukey:
[operations/puppet@production] Add a new Analytics role to an-launcher1001

https://gerrit.wikimedia.org/r/574032

Change 574038 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add Hadoop common hiera configuration

https://gerrit.wikimedia.org/r/574038

nshahquinn-wmf awarded a token.Feb 21 2020, 5:12 PM

Change 574038 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add Hadoop common hiera configuration

https://gerrit.wikimedia.org/r/574038

Change 574042 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos and base profiles

https://gerrit.wikimedia.org/r/574042

Change 574042 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos and base profiles

https://gerrit.wikimedia.org/r/574042

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2020, 6:10 PM

Change 574289 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add Analytics Refinery scap repo

https://gerrit.wikimedia.org/r/574289

gerritbot added a project: Patch-For-Review.Feb 24 2020, 7:43 AM

Change 574289 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add Analytics Refinery scap repo

https://gerrit.wikimedia.org/r/574289

Maintenance_bot removed a project: Patch-For-Review.Feb 24 2020, 8:10 AM

Change 574379 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add git proxy config for Analytics vlan

https://gerrit.wikimedia.org/r/574379

gerritbot added a project: Patch-For-Review.Feb 24 2020, 8:24 AM

Change 574379 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add git proxy config for Analytics vlan

https://gerrit.wikimedia.org/r/574379

Change 574385 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add hdfs RU jobs

https://gerrit.wikimedia.org/r/574385

Change 574385 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add hdfs RU jobs

https://gerrit.wikimedia.org/r/574385

Maintenance_bot removed a project: Patch-For-Review.Feb 24 2020, 7:10 PM

Change 574722 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move all Report Updater Jobs to an-launcher1001

https://gerrit.wikimedia.org/r/574722

gerritbot added a project: Patch-For-Review.Feb 25 2020, 11:55 AM

Change 574780 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos settings for hive

https://gerrit.wikimedia.org/r/574780

Change 574780 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add kerberos settings for hive

https://gerrit.wikimedia.org/r/574780

Change 574786 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] reportupdate::job: use kerberos when needed

https://gerrit.wikimedia.org/r/574786

Change 574786 merged by Elukey:
[operations/puppet@production] reportupdate::job: use kerberos when needed

https://gerrit.wikimedia.org/r/574786

Change 574795 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::private: remove rsync to /mnt/hdfs

https://gerrit.wikimedia.org/r/574795

Change 574795 merged by Elukey:
[operations/puppet@production] role::statistics::private: remove rsync to /mnt/hdfs

https://gerrit.wikimedia.org/r/574795

Change 574843 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-launcher1001 to the list of statistics servers

https://gerrit.wikimedia.org/r/574843

• Nuria added a project: Analytics-Kanban.Feb 26 2020, 3:42 AM

Change 574843 merged by Elukey:
[operations/puppet@production] Add an-launcher1001 to the list of statistics servers

https://gerrit.wikimedia.org/r/574843

Change 575048 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add an-launcher1001 to profile::dumps::distribution

https://gerrit.wikimedia.org/r/575048

Change 575048 merged by Elukey:
[operations/puppet@production] Add an-launcher1001 to profile::dumps::distribution

https://gerrit.wikimedia.org/r/575048

Change 574722 merged by Elukey:
[operations/puppet@production] Move all Report Updater Jobs to an-launcher1001

https://gerrit.wikimedia.org/r/574722

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2020, 8:10 PM

Change 575470 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_mediawiki_dumps timers from stat1007 to an-launcher1001

https://gerrit.wikimedia.org/r/575470

gerritbot added a project: Patch-For-Review.Feb 28 2020, 8:27 AM

Change 575470 merged by Elukey:
[operations/puppet@production] Move import_mediawiki_dumps timers from stat1007 to an-launcher1001

https://gerrit.wikimedia.org/r/575470

Change 575476 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move import_wikidata_entities_dumps timers to an-launcher1001

https://gerrit.wikimedia.org/r/575476

Change 575476 merged by Elukey:
[operations/puppet@production] Move import_wikidata_entities_dumps timers to an-launcher1001

https://gerrit.wikimedia.org/r/575476

Change 575488 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::launcher: add statistics xml dataset mounts

https://gerrit.wikimedia.org/r/575488

Change 575488 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::launcher: add statistics xml dataset mounts

https://gerrit.wikimedia.org/r/575488

Maintenance_bot removed a project: Patch-For-Review.Feb 28 2020, 11:10 AM

The next step now is to complete T246578, setting this task to stalled in the meantime.

elukey mentioned this in T245833: Enable layered data-access and sharing for a new form of collaboration.Mar 1 2020, 4:23 PM

In T243934#5906856, @Ottomata wrote:

Then we'd just have:

analytics-users - all stat boxes + mysql analytics dbs

analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

I'm very excited about the simplifications you're planning here!

However, this naming convention would imply that analytics-users doesn't grant access to sensitive data, even though it would include access to the internal wiki replicas, which include:

editor IP addresses in the cu_changes table
editor email addresses in the users table
revision-deleted usernames and comments in the archive and revision tables

In T243934#5906301, @elukey wrote:

So to recap the proposal, we'd just keep three groups:

statistics-users - access to all stat boxes, no privatedata of any sort

This isn't misleading, so that's good! If you don't have access to internal Hadoop or MariaDB or whatever private logs there are, you really don't have any private data access. However, if we don't want to give a user any private data, why give them production access in the first place? Cloud Services provides plenty of tools for analysis that doesn't rely on private data.

In T243934#5906301, @elukey wrote:

analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.

statistics-users may be kept, even if few users of it are not in any privatedata group.

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

In T243934#5932167, @nshahquinn-wmf wrote:

In T243934#5906856, @Ottomata wrote:

Then we'd just have:

analytics-users - all stat boxes + mysql analytics dbs

analytics-privatedata-users - all stat boxes + Hadoop (via kerberos) + privatedata

I'm very excited about the simplifications you're planning here!

However, this naming convention would imply that analytics-users doesn't grant access to sensitive data, even though it would include access to the internal wiki replicas, which include:

editor IP addresses in the cu_changes table

editor email addresses in the users table

revision-deleted usernames and comments in the archive and revision tables

You are right, the above simplification was only an initial idea, the final proposal is in T246578. Indeed we are going to limit access to the wiki replicas to analytics-privatedata-users :)

In T243934#5906301, @elukey wrote:

So to recap the proposal, we'd just keep three groups:

statistics-users - access to all stat boxes, no privatedata of any sort

This isn't misleading, so that's good! If you don't have access to internal Hadoop or MariaDB or whatever private logs there are, you really don't have any private data access. However, if we don't want to give a user any private data, why give them production access in the first place? Cloud Services provides plenty of tools for analysis that doesn't rely on private data.

In T243934#5906301, @elukey wrote:

analytics-users seems not valid anymore, very few users, I'd just propose to follow up with them and either move them to privatedata or remove them from the group, to finally deprecate analytics-users.

statistics-users may be kept, even if few users of it are not in any privatedata group.

Maybe this is the answer to my question: there is very little point to production access without private data, so we just aren't using those groups. 😁

In that case, why not go all the way to a single group for analytics users which includes private data access? Perhaps in the future, we'll start to genuinely segregate data based on sensitivity (I see it's being discussed in T245833), and of course if we do that, we can introduce a new tier. But currently, it doesn't seem like there's much point.

There are other use cases for people using the stat boxes, that often don't involve private data at all. One example could be to work on GPUs with tensorflow and public data, to train a model. Maybe in the future we could think about private only, but for the moment it seems that we'd cut off some important use cases :)

Change 576384 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Ensure readability settings for home dirs of Analytics clients

https://gerrit.wikimedia.org/r/576384

gerritbot added a project: Patch-For-Review.Mar 3 2020, 4:30 PM

Change 576384 merged by Elukey:
[operations/puppet@production] Ensure readability settings for home dirs of Analytics clients

https://gerrit.wikimedia.org/r/576384

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2020, 2:10 PM

Change 577278 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Add xmldumps to stat100[4,5]

https://gerrit.wikimedia.org/r/577278

gerritbot added a project: Patch-For-Review.Mar 5 2020, 4:21 PM

Change 577297 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove statistics-admins and statistics-web-admins from Analytics

https://gerrit.wikimedia.org/r/577297

Change 577297 merged by Elukey:
[operations/puppet@production] Remove statistics-admins and statistics-web-admins from Analytics

https://gerrit.wikimedia.org/r/577297

elukey added a subtask: T246578: Refactor Analytics POSIX groups in puppet to improve maintainability.Mar 5 2020, 5:47 PM

Change 577309 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Remove profiles from stat100[6,7]'s roles not used anymore

https://gerrit.wikimedia.org/r/577309

Change 577309 merged by Elukey:
[operations/puppet@production] Remove profiles from stat100[6,7]'s roles not used anymore

https://gerrit.wikimedia.org/r/577309

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Mar 6 2020, 7:39 AM

Change 578481 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::compute: deploy mysql credentials only when needed

https://gerrit.wikimedia.org/r/578481

Change 578481 merged by Elukey:
[operations/puppet@production] statistics::compute: deploy mysql credentials only when needed

https://gerrit.wikimedia.org/r/578481

Change 578483 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::mysql_credentials: use require instead of defined

https://gerrit.wikimedia.org/r/578483

Change 578483 merged by Elukey:
[operations/puppet@production] statistics::mysql_credentials: use require instead of defined

https://gerrit.wikimedia.org/r/578483

Change 578484 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Introduce profile::statistics::eventlogging_rsync

https://gerrit.wikimedia.org/r/578484

Change 578484 merged by Elukey:
[operations/puppet@production] Introduce profile::statistics::eventlogging_rsync

https://gerrit.wikimedia.org/r/578484

Change 578535 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Move stat1006 to role::statistics::explorer

https://gerrit.wikimedia.org/r/578535

Change 578535 merged by Elukey:
[operations/puppet@production] Move stat1006 to role::statistics::explorer

https://gerrit.wikimedia.org/r/578535

Change 578541 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer: remove analytics keytab

https://gerrit.wikimedia.org/r/578541

Change 578541 merged by Elukey:
[operations/puppet@production] role::statistics::explorer: remove analytics keytab

https://gerrit.wikimedia.org/r/578541

Ok up to now stat100[4,5,6] have been unified under a single role, role::statistics::explorer. Jupyterhub was also added as well.

Next steps:

move stat1007 to role::statistics::explorer
decommission notebook100[3,4]

elukey changed the task status from Stalled to Open.Mar 10 2020, 3:47 PM

Change 578783 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Follow up after Analytics client host refactoring

https://gerrit.wikimedia.org/r/578783

Change 578783 merged by Elukey:
[operations/puppet@production] Follow up after Analytics client host refactoring

https://gerrit.wikimedia.org/r/578783

Change 577278 merged by Elukey:
[operations/puppet@production] Add xmldumps to stat100[4,5]

https://gerrit.wikimedia.org/r/577278

Maintenance_bot removed a project: Patch-For-Review.Mar 11 2020, 1:10 PM

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 20 2020, 7:33 AM

• Nuria closed subtask T246578: Refactor Analytics POSIX groups in puppet to improve maintainability as Resolved.Apr 9 2020, 5:07 PM

Change 591311 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::statistics::explorer: move target hosts to hiera

https://gerrit.wikimedia.org/r/591311

gerritbot added a project: Patch-For-Review.Apr 21 2020, 9:38 AM

Change 591311 merged by Elukey:
[operations/puppet@production] role::statistics::explorer: move target hosts to hiera

https://gerrit.wikimedia.org/r/591311

Maintenance_bot removed a project: Patch-For-Review.Apr 21 2020, 12:10 PM

elukey changed the status of subtask T249752: Decomission notebook hosts from Open to Stalled.Apr 29 2020, 9:11 AM

elukey changed the status of subtask T249752: Decomission notebook hosts from Stalled to Open.May 7 2020, 2:54 PM

• Nuria closed subtask T245179: Add SWAP profile to stat1005 as Resolved.May 14 2020, 2:41 PM

• Nuria closed subtask T249754: Unify stat1007 puppet role with the rest of the stats cluster as Resolved.May 14 2020, 2:47 PM

Everything is done except decommissioning the notebook hosts, that can be done separately (there is a subtask about it).

elukey set Final Story Points to 21.May 14 2020, 2:49 PM

• Nuria moved this task from Paused to Parent Tasks on the Analytics-Kanban board.May 14 2020, 2:50 PM

elukey moved this task from Parent Tasks to Done on the Analytics-Kanban board.May 14 2020, 2:50 PM

• Nuria closed this task as Resolved.Jun 3 2020, 2:57 PM

• Nuria closed subtask T249752: Decomission notebook hosts as Resolved.Jul 6 2020, 5:56 PM

Unify puppet roles for stat and notebook hostsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Unify puppet roles for stat and notebook hosts
Closed, ResolvedPublic
Actions

Related Objects
Search...