Page MenuHomePhabricator

Cassandra Node.JS driver v3.2.2 issues
Closed, ResolvedPublic

Description

The Cassandra Node.JS driver started using keyspace-based routing in v3.2.2 which has been causing issues to RESTBase in production during start-up: spikes in CPU utilisation and memory consumption. We yet have to identify what exactly is going awry.

See NODEJS-371 for more information.

Event Timeline

@mobrovac, could you update the upstream task with the info about .connect() not waiting for warmup?

@mobrovac, could you update the upstream task with the info about .connect() not waiting for warmup?

{{done}}

With the upstream task closed, it seems now on us to make sure that we re-test with warmup enabled. Moving back to "next" for that reason.

Eevans lowered the priority of this task from High to Medium.Jul 3 2018, 10:13 PM

@Pchelolo, @mobrovac Is this still relevant?

Last time I checked, it was. We will need to carefully try out upgrading to the latest version and see if the issue persists. I believe it will, however.

I have manually installed the driver v3.5.0 to restbase-dev1004 and run some tests - so far everything looks good. I believe we should try to deploy it on a canary host in production.

Mentioned in SAL (#wikimedia-operations) [2018-07-06T10:01:11Z] <mobrovac> restbase depool restbase2001 to test the cassandra node driver v3.5.0 - T169009

v3.5.0 seems to work in production. There is one caveat, though: the RESTBase workers now need twice as much memory as with v3.2.1, which implies that the driver uses much more memory now. On the upside, it seems that this memory is used to populate the driver's cache with info about the connections towards Cassandra instances, since the memory gets filled during start-up, but it is stable throughout a worker's life.

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:30:24Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:34:44Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3] (dev-cluster): Upgrade cassandra driver to 3.5.0 T169009 (duration: 04m 20s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:38:40Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:41:53Z] <ppchelko@deploy1001> deploy aborted: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. T169009 (duration: 03m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:44:28Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T09:55:42Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 1. Only codfw. Take 2, check timed out. T169009 (duration: 11m 14s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T10:58:05Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:06:12Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere. T169009 (duration: 08m 07s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:06:26Z] <ppchelko@deploy1001> Started deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009

Mentioned in SAL (#wikimedia-operations) [2018-07-11T11:09:22Z] <ppchelko@deploy1001> Finished deploy [restbase/deploy@353eca3]: Upgrade cassandra driver to 3.5.0. Part 2. Everywhere.Take 2, feeds timed out. T169009 (duration: 02m 56s)

Pchelolo claimed this task.
Pchelolo edited projects, added Services (done); removed Services (next).

The driver has been upgraded to 3.5.0 in production. Resolving.

The new version of the driver requires significantly more memory, file https://datastax-oss.atlassian.net/browse/NODEJS-460 to ask for driver maintainer's advice.

The number of RangeSlice requests to Cassandra has dropped to 0 as expected as the new driver now uses OPTIONS request for heartbeats: https://grafana.wikimedia.org/dashboard/db/cassandra-client-request?orgId=1

An interesting effect is that during deploy/startup there's a significant spike in READ requests to a single Cassandra node. I will restart the cluster in one of the DCs later and if that happens again, I will file a ticket for the driver.