Cassandra Node.JS driver v3.2.2 issues
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• mobrovac
	Jun 27 2017, 6:47 PM

Description

The Cassandra Node.JS driver started using keyspace-based routing in v3.2.2 which has been causing issues to RESTBase in production during start-up: spikes in CPU utilisation and memory consumption. We yet have to identify what exactly is going awry.

See NODEJS-371 for more information.

Related Objects

Mentioned In: T127472: Investigate reducing impact of single-node Cassandra latencies

Event Timeline

• mobrovac created this task.Jun 27 2017, 6:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 27 2017, 6:47 PM

@mobrovac, could you update the upstream task with the info about .connect() not waiting for warmup?

In T169009#3384049, @GWicke wrote:

@mobrovac, could you update the upstream task with the info about .connect() not waiting for warmup?

• GWicke moved this task from doing to blocked on the Services board.Jul 12 2017, 7:54 PM

• GWicke edited projects, added Services (blocked); removed Services (doing).

With the upstream task closed, it seems now on us to make sure that we re-test with warmup enabled. Moving back to "next" for that reason.

@Pchelolo, @mobrovac Is this still relevant?

• mobrovac mentioned this in T127472: Investigate reducing impact of single-node Cassandra latencies.Jul 4 2018, 6:03 AM

Last time I checked, it was. We will need to carefully try out upgrading to the latest version and see if the issue persists. I believe it will, however.

I have manually installed the driver v3.5.0 to restbase-dev1004 and run some tests - so far everything looks good. I believe we should try to deploy it on a canary host in production.

Let's try it? https://github.com/wikimedia/restbase-mod-table-cassandra/pull/218

Mentioned in SAL (#wikimedia-operations) [2018-07-06T10:01:11Z] <mobrovac> restbase depool restbase2001 to test the cassandra node driver v3.5.0 - T169009

v3.5.0 seems to work in production. There is one caveat, though: the RESTBase workers now need twice as much memory as with v3.2.1, which implies that the driver uses much more memory now. On the upside, it seems that this memory is used to populate the driver's cache with info about the connections towards Cassandra instances, since the memory gets filled during start-up, but it is stable throughout a worker's life.

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:30:24Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:34:44Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009 (duration: 04m 20s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:38:40Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:41:53Z] <ppchelko@deploy1001> deploy aborted: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009 (duration: 03m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:44:28Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:55:42Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009 (duration: 11m 14s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T10:58:05Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:06:12Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009 (duration: 08m 07s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:06:26Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:09:22Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009 (duration: 02m 56s)

The driver has been upgraded to 3.5.0 in production. Resolving.

The new version of the driver requires significantly more memory, file https://datastax-oss.atlassian.net/browse/NODEJS-460 to ask for driver maintainer's advice.

The number of RangeSlice requests to Cassandra has dropped to 0 as expected as the new driver now uses OPTIONS request for heartbeats: https://grafana.wikimedia.org/dashboard/db/cassandra-client-request?orgId=1

An interesting effect is that during deploy/startup there's a significant spike in READ requests to a single Cassandra node. I will restart the cluster in one of the DCs later and if that happens again, I will file a ticket for the driver.

Cassandra Node.JS driver v3.2.2 issuesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Cassandra Node.JS driver v3.2.2 issues
Closed, ResolvedPublic
Actions