Page MenuHomePhabricator

Ensure AQS Cassandra client connections are multi-datacenter
Closed, ResolvedPublic

Description

In preparation for the expansion of the AQS cluster to codfw, we should verify that Cassandra client connections:

  • Use DCAwareRoundRobinPolicy (or similar)
  • Specify their local datacenter
  • Use LOCAL_ variants of consistency levels (as necessary)

Clients to verify/fix:

Event Timeline

Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

@Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster topology asking to the known host(s). However I think it'd be good to restrict data loading from hadoop to just eqiad, to prevent pushing too much data at once accross DCs. In my mind we would restrict sending data to just eqiad hosts, and let cassandra replicate the data (possibly with throuput limitation). Is that even feasible? Is it a good idea? Thoughts welcome :)

@Eevans : The AQS-loader is not datacenter-aware. It takes base hosts as a parameter and gets the cassandra cluster topology asking to the known host(s). However I think it'd be good to restrict data loading from hadoop to just eqiad, to prevent pushing too much data at once accross DCs. In my mind we would restrict sending data to just eqiad hosts, and let cassandra replicate the data (possibly with throuput limitation). Is that even feasible? Is it a good idea? Thoughts welcome :)

It's not only feasible (and a good idea), it's exactly what this datacenter awareness is for! :)

The drivers (when properly configured) have a notion of which datacenter is local, and the use of LOCAL_QUORUM (I assume we're currently using QUORUM), will provide exactly the behavior you describe (synchronous replication to a quorum in eqiad, and asynchronous to the rest - including all replicas in codfw). Do the AQS-loader supporting libraries (Spark?) not support this?

The spark-cassandra-connector indeed supports setting the consistency. It defaults to LOCAL_QUORUM for writes and LOCAL_ONE for reads.

The spark-cassandra-connector indeed supports setting the consistency. It defaults to LOCAL_QUORUM for writes and LOCAL_ONE for reads.

Does it let you configure a load-balancing policy and local data-center?

From what I have seen we can't specify neither a load-balancing policy nor the local datacenter. BUT, from the docs: Connections are never made to data centers other than the data center of spark.cassandra.connection.host. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

From what I have read, the version 4 of the cassandra driver only allows for local connections (from the docs: In driver 4+, we are taking a more opinionated approach: we provide a single load balancing policy, that we consider the best choice for most cases. https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/load_balancing/).

My assumption is that the connector uses the datacenter from host provided in the parameter as the local one. Interestingly, it is possible to provide multiple hosts in this parameter - I assume you could mess up locality by providing not-local connection hosts in the list, but eh, that's what we'll do :)

From what I have seen we can't specify neither a load-balancing policy nor the local datacenter. BUT, from the docs: Connections are never made to data centers other than the data center of spark.cassandra.connection.host. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md

From what I have read, the version 4 of the cassandra driver only allows for local connections (from the docs: In driver 4+, we are taking a more opinionated approach: we provide a single load balancing policy, that we consider the best choice for most cases. https://docs.datastax.com/en/developer/java-driver/4.2/manual/core/load_balancing/).

My assumption is that the connector uses the datacenter from host provided in the parameter as the local one. Interestingly, it is possible to provide multiple hosts in this parameter - I assume you could mess up locality by providing not-local connection hosts in the list, but eh, that's what we'll do :)

Awesome; Thanks @JAllemandou !

For posterity sake:

image.png (135×1 px, 35 KB)

So we're Good on this one.

It's been reported elsewhere that the bulk loader for Image Suggestions is using the same underlying code to interface Cassandra as AQS does, @mfossati, @Cparle if you can confirm this we can check it off the list.

Do we have any code yet for the feedback pipeline? Does anyone know who is (or has) implemented that?

AQS supports multi-DC by virtue of RESTBase supporting it, and it has been configured for localQuorum consistency, and the eqiad datacenter.

It's been reported elsewhere that the bulk loader for Image Suggestions is using the same underlying code to interface Cassandra as AQS does, @mfossati, @Cparle if you can confirm this we can check it off the list.

@Eevans , we're using the org.wikimedia.analytics.refinery.job.HiveToCassandra class contained in the fat jar located at hdfs:///wmf/refinery/current/artifacts/refinery-job-shaded.jar.

Do we have any code yet for the feedback pipeline? Does anyone know who is (or has) implemented that?

From the Data Platform side, please see T302925: [SPIKE] Investigate and Decide on Solution for Image Suggestions Feedback, where it seems that @Ottomata has worked on it. From the data pipeline side, see T299890: [M] Exclude previously rejected image suggestions when generating new suggestions. @Cparle has worked on it.

Since the purpose of this ticket was to ensure that no existing clients were suddenly caught unawares by the addition of a data center that hadn't been there, and since the image suggestions feedback pipeline does not yet exist (i.e. it is not among the set of "existing clients"), I am closing this issue as complete.

Eevans claimed this task.
Eevans updated the task description. (Show Details)
Eevans moved this task from Next to Complete on the Cassandra board.