Page MenuHomePhabricator

๐Ÿ“Š[PLACEHOLDER] We should implement a data loader for Cassandra
Closed, ResolvedPublic

Description

Discussion points for grooming:

  • How do we want to load datasets from Airflow to Cassandra?
  • Do we want to recommend different loading methods based on dataset size?
  • Can we offer general Airflow components to do the loading? Can we abstract it - for example - pass Pandas Dataframe to common load function.
  • How do we handle/store access credentials in Airflow/ETL solutions.

In prod we likely won't rely on cqlsh to load data.

We'll need a scalable solution to load IMA data. Some implementation details will depend on how we'll access Cassandra from k8. Discussion ongoing at https://phabricator.wikimedia.org/T280042.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 29 2021, 7:18 PM

How do we want to load datasets from Airflow to Cassandra?

While we think about this, let's also consider the more general case: How do we load from X to Y. Right now, X is probably Hadoop or Kafka. Y might be Cassandra, Hadoop, Kafka, MariaDB, Swift.

Flink has connectors for all these things. From reading docs it seems like it should be possible to automate these integrations, as long as we have a standardized way serializing and deserializing records from each of these sources (JSON+JSONSchema, Parquet, etc.). In practice this might not work, but we should probably explore to see if it is possible.

I'm wondering how to tie this into the notion of tenancy for backing stores (Cassandra, for the time being). For example: Will we have a single tenant (read: a unique user w/ credentials et al) for the data loading process, or will we have many (presumably, one for each dataset being loaded)? In a world with a platform that permits arbitrary teams to own scheduled jobs that persist output to a backing store -in a more-or-less- self-service fashion, we would want to ensure that an aberrant change or misconfiguration of one job, cannot inadvertently step on the data of another (which separate credentials & permissions would provide).

Based on the code linked in the description (HiveToCassandra.scala), I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

Is it worth considering the former (at this time)? I can see benefits to this approach (simplicity, decoupling of storage, ...), but concede that it may be premature to go there at this time.

I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

The loader take a user and password as parameters, and this allows both of the approaches you describe above (single or multi tenant). I think we should aim at the having a multi-tenant configuration, possibly per team owning datasets? This would prevent errors as you describe.
For the moment we use a single tenant (we were a single team loading data :) and I'm supportive of not doing multi-tenant for now as we're talking about only one dataset from a different team. I would like however that we ultimately try to implement proper multi-tenant configuration, and this would also entail changes on the loading side to provide correct credentials as secrets.

I assume we're looking at the latter, and will need to create database roles and corresponding credentials that match the job configuration passed to the loader, is this correct?

The loader take a user and password as parameters, and this allows both of the approaches you describe above (single or multi tenant). I think we should aim at the having a multi-tenant configuration, possibly per team owning datasets? This would prevent errors as you describe.
For the moment we use a single tenant (we were a single team loading data :) and I'm supportive of not doing multi-tenant for now as we're talking about only one dataset from a different team. I would like however that we ultimately try to implement proper multi-tenant configuration, and this would also entail changes on the loading side to provide correct credentials as secrets.

I didn't articulate this well; In a multi-tenant environment, we need to ensure that jobs creating/updating datasets can't interfere with one another. We wouldn't want -for example- a recurring update of an Image Suggestions dataset to clobber AQS data. Achieving this won't be hard (at least so long as Cassandra is the backing-store), worst-case we just need a user for each, with access limited accordingly. That's one way, and it seems the work being described here would permit this much at least.

The other way would be to decouple storage from the jobs that generate data, and manage tenancy using whatever middleware was used to accomplish this. I don't think multi-tenancy (alone) would justify doing this, the real benefits would come from the looser coupling.

My gut tells me it's too early to be tackling this now, but I wanted to throw it out there.

Closing. Implementation work will happen in T296758.