📊[PLACEHOLDER] We should implement a data loader for Cassandra
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	gmodena
	Apr 29 2021, 7:18 PM

Description

Discussion points for grooming:

How do we want to load datasets from Airflow to Cassandra?
- Solution from Analytics team
Do we want to recommend different loading methods based on dataset size?
Can we offer general Airflow components to do the loading? Can we abstract it - for example - pass Pandas Dataframe to common load function.
How do we handle/store access credentials in Airflow/ETL solutions.

In prod we likely won't rely on cqlsh to load data.

We'll need a scalable solution to load IMA data. Some implementation details will depend on how we'll access Cassandra from k8. Discussion ongoing at https://phabricator.wikimedia.org/T280042.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T281517 📊[PLACEHOLDER] We should implement a data loader for Cassandra
		Resolved		JAllemandou	T297934 Update HiveToCassandra for variable substitution and HQL from files loading

Event Timeline

gmodena created this task.Apr 29 2021, 7:18 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 29 2021, 7:18 PM

gmodena mentioned this in T280794: [SPIKE] 📊 Import data into local Cassandra development db.Apr 29 2021, 7:24 PM

• Clarakosi edited projects, added Generated Data Platform; removed Platform Team Workboards (Image Suggestion API).Oct 13 2021, 4:30 PM

• Clarakosi added a subscriber: lbowmaker.

Ottomata subscribed.Oct 18 2021, 7:26 PM

lbowmaker updated the task description. (Show Details)Nov 10 2021, 2:42 PM

How do we want to load datasets from Airflow to Cassandra?

While we think about this, let's also consider the more general case: How do we load from X to Y. Right now, X is probably Hadoop or Kafka. Y might be Cassandra, Hadoop, Kafka, MariaDB, Swift.

Flink has connectors for all these things. From reading docs it seems like it should be possible to automate these integrations, as long as we have a standardized way serializing and deserializing records from each of these sources (JSON+JSONSchema, Parquet, etc.). In practice this might not work, but we should probably explore to see if it is possible.

Ottomata added subscribers: JAllemandou, Milimetric, odimitrijevic.Nov 10 2021, 3:09 PM

lbowmaker mentioned this in T295483: [SPIKE] Investigate and Decide on solution for Airflow > Cassandra for Image Suggestions Dataset.Nov 24 2021, 2:30 PM

I'm wondering how to tie this into the notion of tenancy for backing stores (Cassandra, for the time being). For example: Will we have a single tenant (read: a unique user w/ credentials et al) for the data loading process, or will we have many (presumably, one for each dataset being loaded)? In a world with a platform that permits arbitrary teams to own scheduled jobs that persist output to a backing store -in a more-or-less- self-service fashion, we would want to ensure that an aberrant change or misconfiguration of one job, cannot inadvertently step on the data of another (which separate credentials & permissions would provide).

Based on the code linked in the description (HiveToCassandra.scala), I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

Is it worth considering the former (at this time)? I can see benefits to this approach (simplicity, decoupling of storage, ...), but concede that it may be premature to go there at this time.

I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

The loader take a user and password as parameters, and this allows both of the approaches you describe above (single or multi tenant). I think we should aim at the having a multi-tenant configuration, possibly per team owning datasets? This would prevent errors as you describe.
For the moment we use a single tenant (we were a single team loading data :) and I'm supportive of not doing multi-tenant for now as we're talking about only one dataset from a different team. I would like however that we ultimately try to implement proper multi-tenant configuration, and this would also entail changes on the loading side to provide correct credentials as secrets.

In T281517#7577179, @JAllemandou wrote:

I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

The loader take a user and password as parameters, and this allows both of the approaches you describe above (single or multi tenant). I think we should aim at the having a multi-tenant configuration, possibly per team owning datasets? This would prevent errors as you describe.
For the moment we use a single tenant (we were a single team loading data :) and I'm supportive of not doing multi-tenant for now as we're talking about only one dataset from a different team. I would like however that we ultimately try to implement proper multi-tenant configuration, and this would also entail changes on the loading side to provide correct credentials as secrets.

I didn't articulate this well; In a multi-tenant environment, we need to ensure that jobs creating/updating datasets can't interfere with one another. We wouldn't want -for example- a recurring update of an Image Suggestions dataset to clobber AQS data. Achieving this won't be hard (at least so long as Cassandra is the backing-store), worst-case we just need a user for each, with access limited accordingly. That's one way, and it seems the work being described here would permit this much at least.

The other way would be to decouple storage from the jobs that generate data, and manage tenancy using whatever middleware was used to accomplish this. I don't think multi-tenancy (alone) would justify doing this, the real benefits would come from the looser coupling.

My gut tells me it's too early to be tackling this now, but I wanted to throw it out there.

gmodena moved this task from Backlog to Investigate 🔍 on the Generated Data Platform board.Mar 15 2022, 2:17 PM

Closing. Implementation work will happen in T296758.

JAllemandou closed subtask T297934: Update HiveToCassandra for variable substitution and HQL from files loading as Resolved.Apr 27 2022, 5:12 AM

📊[PLACEHOLDER] We should implement a data loader for CassandraClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

📊[PLACEHOLDER] We should implement a data loader for Cassandra
Closed, ResolvedPublic
Actions

Related Objects
Search...