Page MenuHomePhabricator

Cassandra test cluster as a staged pathway to production for image suggestions data pipelines
Open, MediumPublic

Description

The Image-Suggestions and Section-Level-Image-Suggestions data pipelines final step is to transfer output datasets from Hive to Cassandra.
A test Cassandra instance would be useful to minimize risks in the production one, such as T317364: [M] Stop unbounded image suggestions dataset growth and clean up legacy results.

Requirements

  • create a test keyspace
  • define its schema
  • insert into & truncate tables in the given keyspace
  • run INSERT statements, see here for a production example. This is done through a Spark job.

Event Timeline

  • create a test keyspace
  • define its schema
  • insert into & truncate tables in the given keyspace
  • run INSERT statements, see here for a production example. This is done through a Spark job.

If we're treating this as a staging environment, wouldn't the process for creating and/or altering schema, and the application of any grants applied, match that of production? Because part of what we're staging/testing/validating would be the application of schema changes, and proper functionality under the permissions we intend to use in production.

Presumably there is also some need for Cassandra access at earlier stages of the development though, I'm not sure what would best work for that (or what alternatives —if any— might exist). Can you elaborate on what the constraints are here? For example: I'm guessing at least one challenge is corresponding access to other systems in the analytics cluster, as well as realistic data.


Also, a couple of concerns we'll need to address:

First, the cluster available for this (cassandra-dev) is located in codfw and the analytics cluster in eqiad, requiring access to transit data centers. We'll need to make sure this OK, establish a reasonable upper bound on throughput, and have the means to enforce it.

Second: cassandra-dev is quite a bit smaller than the production AQS cluster, so we'll need to be able to adjust both the quantity of data stored there, and the read/write throughput, accordingly. Ideally we'd establish the capacity differential as a multiplier —say 0.25— and a configurable on your end would adjust dataset size, and concurrency(?) by 0.25.

Thoughts?

Eevans triaged this task as Medium priority.Apr 5 2024, 8:45 PM