Page MenuHomePhabricator

Provide developer access to the cassandra-dev cluster
Open, MediumPublic

Description

The cassandra-dev cluster is a 3-host, 6-instance Cassandra cluster in the production network. We use it to stage changes, both to environments (Cassandra, JVM, etc), and services (sessionstore, echo last-modified timestamps, page content service, etc). For its role in the latter, it makes sense that we provide developer access via cqlsh.

I propose we accomplish this by following examples set in modules/admin/data/data.yaml, with a group and corresponding sudo privileges. The sudoers rule would allow the invocation of cqlsh, either directly, or via a wrapper.

Database credentials can be templated to ~/.cassandra/credentials (cqlsh defaults to reading them from here). I propose that we create a database user/role for this purpose, and that the grants (full set of grants TBD) be applied to all tables (meaning that everyone added to the group would have the same access, to all tables).

TBD: Create a dedicated system user for this? Use the cassandra user?

Event Timeline

Eevans renamed this task from Provider developer access to the cassandra-dev cluster to Provide developer access to the cassandra-dev cluster.Jan 24 2024, 4:02 AM

Presumably we want to restrict access somewhat beyond "everything the cassandra user can do"? At which point a separate user to sudo to seems like a sensible idea unless it's a lot of hassle...

Presumably we want to restrict access somewhat beyond "everything the cassandra user can do"? At which point a separate user to sudo to seems like a sensible idea unless it's a lot of hassle...

+1, the blast radius for the cassandra user is fairly large.

Hey @Eevans is there any update on that? We are picking up the cassandra/PCS work and dev access would be useful to be in place to test things on staging.

Hey @Eevans is there any update on that? We are picking up the cassandra/PCS work and dev access would be useful to be in place to test things on staging.

Hi @Jgiannelos; We have consensus on how to handle this, it's just a matter of getting it done. I will try to make some time next week, failing that it'll probably be sometime later in March.

Change #1016899 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] (WIP) cassandra-dev: surrogate user for cqlsh dev access

https://gerrit.wikimedia.org/r/1016899

Hi, is there any update with dev access for PCS devs?

Hi, is there any update with dev access for PCS devs?

Some progress has been made, but it's been put on the back-burner a few times to deal with other priorities; Apologies, I hope to have it done soon.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

For a different usecase, If people need to query production data manually, we have a dedicated mw script to allow broad access without the needing to hand over credential to each user, they just needed to be added to an ldap group and then they can login to mwmaint and run "sql enwiki"

If people want to run tests on the actual production data to test large scale changes, we have a "test-s4" cluster we give access to but that's basically the last resort and we haven't used it in the past five years (for devs testings, we test operation stuff on it all the time)

So my question here would be: What is the exact need? If you want to develop a system and you just need some test data, why not using cloud VPS? if you need to access production data but not change it, then creating a general purpose script with access limited to a ldap group and adding users to that group would be more future proof (and you could give a read only access to avoid disasters in production, also never giving access to sessionstore?)

Obviously, this is my limited observation from far based on a different ecosystem and I might be missing a lot of obvious points here.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

Ok, so to be clear: This ticket was the result of needs arising from the Generated Data (née AQS) cluster, and not the RESTBase cluster. @Jgiannelos —who may yet weigh in with their requirements— is awaiting the outcome of this issue in relation to the latter. The cassandra-dev cluster is hosting use-cases for both, and I hadn't planned on differentiating when it came to providing developer access.

For the Generated Data cluster, the data there is written by processes running in the Analytics cluster. Ideally, if you put a Cassandra cluster in beta, you'd also be replicating all the infrastructure that loads data. You'd also either want to use data that was entirely synthetic, or have some way of syncing a sanitized/anonymized subset of data from production. And maybe that's exactly what we should do, but that's a very complicated and expensive project that won't happen overnight.

What has been happening thus far, is that testing of analytics jobs that load data into Cassandra has been happening in the production environment, and when developer access was needed, that too has been happening in production. To be fair, the AQS cluster used to be single-tenant; It was owned & operated by the team who owned the data and services, so that made more sense. Now that it's multi-tenant, I opened this ticket to move those activities to cassandra-dev and formalize (and hopefully) improve how that access was provided.

And again, all of the above rationale (the why) pertains to the Generated Data cluster, but I had thought to apply access consistently (even if the rationale is different).

For a different usecase, If people need to query production data manually, we have a dedicated mw script to allow broad access without the needing to hand over credential to each user, they just needed to be added to an ldap group and then they can login to mwmaint and run "sql enwiki"

That's a wrapper around mwscript mysql.php ..., right? I should probably spend some time to better understand everything that script does, but it seems very application-specific, and I'm not sure it applies the same here (both with Cassandra, and the multi-tenant environment).

If people want to run tests on the actual production data to test large scale changes, we have a "test-s4" cluster we give access to but that's basically the last resort and we haven't used it in the past five years (for devs testings, we test operation stuff on it all the time)

That sounds more analogous to what we are/proposing here (sans the question of who has access).

So my question here would be: What is the exact need? If you want to develop a system and you just need some test data, why not using cloud VPS? if you need to access production data but not change it, then creating a general purpose script with access limited to a ldap group and adding users to that group would be more future proof (and you could give a read only access to avoid disasters in production, also never giving access to sessionstore?)

Obviously, this is my limited observation from far based on a different ecosystem and I might be missing a lot of obvious points here.

I was asked to provide feedback from mariadb perspective (and how consistent we want to be across different technologies but in the same team).

We don't usually hand over dev accounts to the staging environment. Many development/staging work gets done on mariadb instance outside of production, most notably beta cluster (which has its own issues but I assume setting up a dedicated project for cassandra in cloud VPS and giving access to that wouldn't be too hard). Given that it's outside of prod, the impact of mistakes or compromise is quite limited, It also discourages "testing in production" situation. I know the staging cluster has different data but still it's in prod infra with all the complexities/downsides that it brings with itself.

Ok, so to be clear: This ticket was the result of needs arising from the Generated Data (née AQS) cluster, and not the RESTBase cluster. @Jgiannelos —who may yet weigh in with their requirements— is awaiting the outcome of this issue in relation to the latter. The cassandra-dev cluster is hosting use-cases for both, and I hadn't planned on differentiating when it came to providing developer access.

As I said, I'm not recommending any action. This is my limited observation from mariadb point of view, feel free to completely ignore.

For the Generated Data cluster, the data there is written by processes running in the Analytics cluster. Ideally, if you put a Cassandra cluster in beta, you'd also be replicating all the infrastructure that loads data. You'd also either want to use data that was entirely synthetic, or have some way of syncing a sanitized/anonymized subset of data from production. And maybe that's exactly what we should do, but that's a very complicated and expensive project that won't happen overnight.

Building a generating AQS system is indeed a lot of work and I don't recommend that either.

In many cases I think some fake data would be enough. If you need something to test PCS, I assume you don't need to the actual data. Of course the team themselves might have a different idea (and it depends on the usecase). I'd say we could ask them and if see that's what they could live with that or they need an actual data to build and use.

The thing is that "test in production" and "test with production data" are recipes for all sorts of disasters (data breach caused by compromises, data corruption, accidentally dropping tables which has happened, etc.) and we should push towards not giving access easily unless last resort.

What has been happening thus far, is that testing of analytics jobs that load data into Cassandra has been happening in the production environment, and when developer access was needed, that too has been happening in production. To be fair, the AQS cluster used to be single-tenant; It was owned & operated by the team who owned the data and services, so that made more sense. Now that it's multi-tenant, I opened this ticket to move those activities to cassandra-dev and formalize (and hopefully) improve how that access was provided.

And again, all of the above rationale (the why) pertains to the Generated Data cluster, but I had thought to apply access consistently (even if the rationale is different).

For a different usecase, If people need to query production data manually, we have a dedicated mw script to allow broad access without the needing to hand over credential to each user, they just needed to be added to an ldap group and then they can login to mwmaint and run "sql enwiki"

That's a wrapper around mwscript mysql.php ..., right? I should probably spend some time to better understand everything that script does, but it seems very application-specific, and I'm not sure it applies the same here (both with Cassandra, and the multi-tenant environment).

Yes but I'm not recommending to do that. I meant reusing the same concept and build an access wrapper that's more controlled and central. So you could make sure writes or data corruption won't happen, protect session store, etc.

If people want to run tests on the actual production data to test large scale changes, we have a "test-s4" cluster we give access to but that's basically the last resort and we haven't used it in the past five years (for devs testings, we test operation stuff on it all the time)

That sounds more analogous to what we are/proposing here (sans the question of who has access).

It's quite rare in our case. I think we created a couple of db users and provided access there but it happens very very rarely (~once every half a decade). If you want to know more, we basically take a replica of actual data, cut the replication and then turn that into a test host.

Change #1024805 had a related patch set uploaded (by Eevans; author: Eevans):

[labs/private@master] cassandra: add (faux) password for cassandra-devel user

https://gerrit.wikimedia.org/r/1024805

Change #1024805 merged by Eevans:

[labs/private@master] cassandra: add (faux) password for cassandra-devel user

https://gerrit.wikimedia.org/r/1024805

Change #1016899 merged by Eevans:

[operations/puppet@production] cassandra-dev: surrogate user for cqlsh (dev access)

https://gerrit.wikimedia.org/r/1016899

Change #1024811 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev: ensure directory exists before adding files

https://gerrit.wikimedia.org/r/1024811

Change #1024811 merged by Eevans:

[operations/puppet@production] cassandra-dev: ensure directory exists before adding files

https://gerrit.wikimedia.org/r/1024811

Change #1024820 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: rename surrogate user

https://gerrit.wikimedia.org/r/1024820

Change #1024820 merged by Eevans:

[operations/puppet@production] cassandra_dev: rename surrogate user

https://gerrit.wikimedia.org/r/1024820

Change #1024821 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev: comment the cassandra_dev DDL (no-op change)

https://gerrit.wikimedia.org/r/1024821

Change #1024821 merged by Eevans:

[operations/puppet@production] cassandra-dev: comment the cassandra_dev DDL (no-op change)

https://gerrit.wikimedia.org/r/1024821

Change #1024822 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra-dev: use the correct `CREATE ROLE` syntax

https://gerrit.wikimedia.org/r/1024822

Change #1024822 merged by Eevans:

[operations/puppet@production] cassandra-dev: use the correct `CREATE ROLE` syntax

https://gerrit.wikimedia.org/r/1024822

Change #1024823 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: use correct path to credentials file

https://gerrit.wikimedia.org/r/1024823

Change #1024823 merged by Eevans:

[operations/puppet@production] cassandra_dev: use correct path to credentials file

https://gerrit.wikimedia.org/r/1024823

Change #1024826 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: fix permissions on credentials file

https://gerrit.wikimedia.org/r/1024826

Change #1024826 merged by Eevans:

[operations/puppet@production] cassandra_dev: fix permissions on credentials file

https://gerrit.wikimedia.org/r/1024826

Change #1024828 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: force ssl for cqlsh sessions (surrogate user)

https://gerrit.wikimedia.org/r/1024828

Change #1024828 merged by Eevans:

[operations/puppet@production] cassandra_dev: force ssl for cqlsh sessions (surrogate user)

https://gerrit.wikimedia.org/r/1024828