Cassandra test cluster as a staged pathway to production for image suggestions data pipelines
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• mfossati
	Feb 3 2023, 3:36 PM

Description

The Image-Suggestions and Section-Level-Image-Suggestions data pipelines final step is to transfer output datasets from Hive to Cassandra.
A test Cassandra instance would be useful to minimize risks in the production one, such as T317364: [M] Stop unbounded image suggestions dataset growth and clean up legacy results.

Requirements

create a test keyspace
define its schema
insert into & truncate tables in the given keyspace
run INSERT statements, see here for a production example. This is done through a Spark job.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T311814 [EPIC] Section-level image suggestions data pipeline
Resolved	• mfossati	T320831 Section Level Image Suggestions - Data Persistence Request
Open	None	T328778 Cassandra test cluster as a staged pathway to production for image suggestions data pipelines

Event Timeline

• mfossati created this task.Feb 3 2023, 3:36 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2023, 3:36 PM

• mfossati mentioned this in T320831: Section Level Image Suggestions - Data Persistence Request.Feb 3 2023, 3:37 PM

• mfossati added a parent task: T320831: Section Level Image Suggestions - Data Persistence Request.

create a test keyspace

define its schema

insert into & truncate tables in the given keyspace

run INSERT statements, see here for a production example. This is done through a Spark job.

If we're treating this as a staging environment, wouldn't the process for creating and/or altering schema, and the application of any grants applied, match that of production? Because part of what we're staging/testing/validating would be the application of schema changes, and proper functionality under the permissions we intend to use in production.

Presumably there is also some need for Cassandra access at earlier stages of the development though, I'm not sure what would best work for that (or what alternatives —if any— might exist). Can you elaborate on what the constraints are here? For example: I'm guessing at least one challenge is corresponding access to other systems in the analytics cluster, as well as realistic data.

Also, a couple of concerns we'll need to address:

First, the cluster available for this (cassandra-dev) is located in codfw and the analytics cluster in eqiad, requiring access to transit data centers. We'll need to make sure this OK, establish a reasonable upper bound on throughput, and have the means to enforce it.

Second: cassandra-dev is quite a bit smaller than the production AQS cluster, so we'll need to be able to adjust both the quantity of data stored there, and the read/write throughput, accordingly. Ideally we'd establish the capacity differential as a multiplier —say 0.25— and a configurable on your end would adjust dataset size, and concurrency(?) by 0.25.

Thoughts?

Eevans moved this task from Backlog to Next on the Cassandra board.Feb 22 2024, 5:42 PM

Eevans triaged this task as Medium priority.Apr 5 2024, 8:45 PM

Eevans mentioned this in T350882: Query additional sample data for AQS testing.Apr 5 2024, 9:16 PM

Eevans moved this task from Next to Backlog on the Cassandra board.Apr 15 2024, 11:49 PM

Cassandra test cluster as a staged pathway to production for image suggestions data pipelinesOpen, MediumPublicActions

Description

Requirements

Related ObjectsSearch...

Event Timeline

Cassandra test cluster as a staged pathway to production for image suggestions data pipelines
Open, MediumPublic
Actions

Related Objects
Search...