Cassandra client encryption
Closed, ResolvedPublic

Description

The nodejs clients DCAwareRoundRobinPolicy allows client fail-over to nodes in another DC in the event no local nodes are available. Enabling this is desirable, but implicitly requires that we encrypt the client connection.

Configuring the nodejs client is fairly straightforward, but like with T108953, we will need keys and certificates generated for each node/instance, and a copy of the node/instance certificates from T108953 available in PEM format to each node/instance.

var options = {
  policies: {
     loadBalancing: new cassandra.loadBalancing.DCAwareRoundRobinPolicy()
  },
  sslOptions: {
    key: '...',     // string or Buffer containing the certificate in PEM format
    cert: '...',    // string or Buffer containing the certificate in PEM format
    ca:  '...'      // an array of strings of Buffers containing trusted certs in PEM format
  },
};

var client = new Client(options);

http://docs.datastax.com/en/developer/nodejs-driver/3.2/api/type.ClientOptions

Update

RESTBase support for client encryption is there; Configuring it is as simple as:

table:
  backend: cassandra
  ...
  # See: https://nodejs.org/api/tls.html#tls_tls_createsecurecontext_options
  tls:
    key:  <path>
    cert: <path>
    ca:   <path>
    ...

Cassandra does in fact support optional client encryption, which will vastly simplify rollout.

There are a few approaches to this we could take, depending on the problem we're trying to solve:

  1. Encryption-only
    1. Optional
    2. Required
  2. Encryption + certificate-based client auth
    1. Optional
    2. Required

If all we are concerned with is the avoidance of drivers communicating in the clear (which risks exposure of the password), then (1a) would be very easy to setup, and we do not need to worry about using (real) keys/certs on the client. A trivial puppet changeset to enable client encryption in Cassandra and a rolling restart later, and we're done.

(2a) would provide additional security, but at the expense of additional complexity. cassandra-ca-manager already creates all of the key material we will need, but would need to be taught to export the key in PEM encoded format as well. For existing keys, we'd need to do this manually, commit the result to the private repo, and deploy them to the nodes.

Either of (1b), (2a), or (2b) will also require that we also come up with an answer for locally sourced cqlsh connections (straightforward I assume, but worth mentioning).


Proposed rollout strategy (eqiad)

1. Update RESTBase config for defaultConsistency of localOne
1. Disable puppet on restbase100[1-9].eqiad nodes
1. Merge cassandra.yaml config change that enables client encryption
1. Reenable puppet and restart Cassandra on 1001, 1003, and 1005, applying new encryption settings from #3.
1. Rollout RESTBase configuration change to enable client encryption
1. Restart Cassandra on 1002, 1004, 1006, and 100[7-9], (applying new encryption settings from #2).
1. Restore RESTBase config for defaultConsistency to localQuorum

The idea here being that when client encryption is enabled on 1001, 1003, and 1005, existing restbase instances that would connect to these, will fail-over to the remaining nodes of the cluster when unable to connect in the clear (1, 3, and 5 represent nodes in each of racks A, B, and C). Next, RESTBase is reconfigured to use encryption and restarted, with connections failing over as needed to 1001, 1003, and 1005 when encrypted connections to 1002, 1004, 1006, and 100[7-9] fail. Finally, the client encryption settings are applied to remaining Cassandra nodes (1002, 1004, 1006, and 100[7-9]). Setting the consistency level to localOne ensures that queries succeed during the window when only 1 node in each rack is accessible.

Alternative rollout strategy (eqiad)

1. Configure codfw nodes for client encryption
1. Update RESTBase config to set localDc to codfw
1. Rollout RESTBase configuration change to enable client encryption, (and restart RESTBase)
1. Merge cassandra.yaml config change that enables client encryption (and restart eqiad Cassandra nodes)
1. Restore RESTBase config for localDc to eqiad

See also: T108953

Eevans created this task.Sep 1 2015, 9:39 PM
Eevans added a subscriber: Eevans.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2015, 9:39 PM
Eevans triaged this task as "Normal" priority.Sep 1 2015, 9:40 PM
Eevans added a project: RESTBase-Cassandra.
Eevans set Security to None.
Eevans added a subscriber: fgiunchedi.
Eevans edited the task description. (Show Details)Sep 1 2015, 9:42 PM
Eevans added a comment.Sep 9 2015, 6:53 PM

So, as it turns out, you can in fact encrypt client connections while disabling certificate based authentication in Cassandra (require_client_auth: false). On the node-side, the keystore (and corresponding password) are still required, on the client-side, 'a certificate' (any certificate, apparently) is required.

From a software standpoint, I don't think this changes anything; RESTBase needs to be made configurable in this regard (pull-request for that forthcoming), but it begs the question of how we should configure our environment.

The raison d'etre here was to avoid sending traffic in the clear. However, since even with client authentication disabled, it's still going to be necessary to supply a certificate for client connections (including cqlsh), perhaps it's worthwhile to go all the way. With https://gerrit.wikimedia.org/r/#/c/236389/ in place, generating signed client certificates would require no additional effort, and might render T92590 moot.

These are two separate concerns, which both fall under authentication, albeit of a different kind.

So, as it turns out, you can in fact encrypt client connections while disabling certificate based authentication in Cassandra (require_client_auth: false). On the node-side, the keystore (and corresponding password) are still required, on the client-side, 'a certificate' (any certificate, apparently) is required.

Any certificate, yes, but not one that doesn't clearly identify the communicating sides. The main benefit here is, apart from sending plain-text data, avoid MITM attacks.

The raison d'etre here was to avoid sending traffic in the clear. However, since even with client authentication disabled, it's still going to be necessary to supply a certificate for client connections (including cqlsh), perhaps it's worthwhile to go all the way. With https://gerrit.wikimedia.org/r/#/c/236389/ in place, generating signed client certificates would require no additional effort, and might render T92590 moot.

If I understand you correctly, you are mixing two different things. Wrt to certificates (and SSL/TLS in general), they ensure that both sides of the connection are who they say they are. That does automatically mean they have the right to connect to the cluster, though. So, from the CA's perspective (and its certificate), it can only ensure the former, but not the latter.

Eevans added a comment.Sep 9 2015, 8:25 PM

...on the client-side, 'a certificate' (any certificate, apparently) is required.

Slight correction/clarification. The driver requires a cert option, but it needn't be an actual certificate (a zero length string for example, will suffice). This requires that RESTBase have some means of configuring encryption, but it does not (necessarily) need to have an actual certificate.

Eevans added a comment.Sep 9 2015, 8:35 PM

These are two separate concerns, which both fall under authentication, albeit of a different kind.

So, as it turns out, you can in fact encrypt client connections while disabling certificate based authentication in Cassandra (require_client_auth: false). On the node-side, the keystore (and corresponding password) are still required, on the client-side, 'a certificate' (any certificate, apparently) is required.

Any certificate, yes, but not one that doesn't clearly identify the communicating sides. The main benefit here is, apart from sending plain-text data, avoid MITM attacks.

Actually, it can be a certificate that doesn't clearly identify the communicating sides, it can even be a zero-length Buffer (i.e. it can be anything, or left unconfigured entirely). I'm not stating this is Good, but this is how it works.

The raison d'etre here was to avoid sending traffic in the clear. However, since even with client authentication disabled, it's still going to be necessary to supply a certificate for client connections (including cqlsh), perhaps it's worthwhile to go all the way. With https://gerrit.wikimedia.org/r/#/c/236389/ in place, generating signed client certificates would require no additional effort, and might render T92590 moot.

If I understand you correctly, you are mixing two different things. Wrt to certificates (and SSL/TLS in general), they ensure that both sides of the connection are who they say they are. That does automatically mean they have the right to connect to the cluster, though. So, from the CA's perspective (and its certificate), it can only ensure the former, but not the latter.

The goal of this issue was to ensure that traffic not be sent in the clear. Which can be achieved whether require_client_auth is enabled or not. Enabling require_client_auth would allow us to limit client connections to only those that are "trusted" (i.e. those with a certificate signed by our CA). That is a a form of authentication, and possibly one that would satisfy the requirements of T92590.

Actually, it can be a certificate that doesn't clearly identify the communicating sides, it can even be a zero-length Buffer (i.e. it can be anything, or left unconfigured entirely). I'm not stating this is Good, but this is how it works.

Uuu, that's fishy, to say the least.

The goal of this issue was to ensure that traffic not be sent in the clear. Which can be achieved whether require_client_auth is enabled or not. Enabling require_client_auth would allow us to limit client connections to only those that are "trusted" (i.e. those with a certificate signed by our CA). That is a a form of authentication, and possibly one that would satisfy the requirements of T92590.

Ah, I see, I misunderstood then :) And yes, I agree that it would be enough as long as we can whitelist orgs / domains / hosts of the connecting clients.

Ah, I see, I misunderstood then :) And yes, I agree that it would be enough as long as we can whitelist orgs / domains / hosts of the connecting clients.

We do have basic firewalling in place, which limits access to RESTBase node IPs only. While this isn't perfect, I think it should be good enough for now, especially considering that the link between eqiad & codfw is over private fiber & not the public internet.

Eevans edited the task description. (Show Details)Sep 10 2015, 4:44 PM

Re roll-out strategy:

The idea here being that when client encryption is enabled on 1001, 1003, and 1005, existing restbase instances that would connect to these, will fail-over to the remaining nodes of the cluster when unable to connect in the clear (1, 3, and 5 represent nodes in each of racks A, B, and C).

I don't see how this can work when you take out nodes from different racks. With vnodes, some key space ranges will end up mapping to two or even all of the unavailable nodes, which means that some clients *will* see errors. Taking out one rack at a time in combination with a temporary switch to consistency ONE could work, but this is something that we'd need to thoroughly test in staging first.

My initial assumption was that we'd avoid touching the eqiad nodes until codfw is fully up & replicated to. Once we are there, we might also have the option of temporarily failing over to codfw, which could give us a window for switching eqiad to full encryption without service downtime and with reduced risk.

Re roll-out strategy:

The idea here being that when client encryption is enabled on 1001, 1003, and 1005, existing restbase instances that would connect to these, will fail-over to the remaining nodes of the cluster when unable to connect in the clear (1, 3, and 5 represent nodes in each of racks A, B, and C).

I don't see how this can work when you take out nodes from different racks. With vnodes, some key space ranges will end up mapping to two or even all of the unavailable nodes, which means that some clients *will* see errors.

Yes. you're right.

Taking out one rack at a time in combination with a temporary switch to consistency ONE could work, but this is something that we'd need to thoroughly test in staging first.

This seems like it would work, yes.

My initial assumption was that we'd avoid touching the eqiad nodes until codfw is fully up & replicated to. Once we are there, we might also have the option of temporarily failing over to codfw, which could give us a window for switching eqiad to full encryption without service downtime and with reduced risk.

This should work as well.

Eevans edited the task description. (Show Details)Sep 14 2015, 10:19 PM

My initial assumption was that we'd avoid touching the eqiad nodes until codfw is fully up & replicated to. Once we are there, we might also have the option of temporarily failing over to codfw, which could give us a window for switching eqiad to full encryption without service downtime and with reduced risk.

Could this be done entirely with pybal, I wonder?

faidon added a subscriber: faidon.May 12 2016, 4:39 PM

Ping? :) This is not urgent but let's not lose momentum and get this moving again — what are the blockers now?

To clarify, can we enable basic (un-authenticated) TLS in combination with password authentication? If so, then that sounds like it would be a decent intermediate step, with less complexity than per-node client certs.

@Eevans, could you move this along?

GWicke edited projects, added Services (next); removed Services.Feb 16 2017, 6:23 PM
Eevans edited the task description. (Show Details)Mar 8 2017, 5:16 PM
Eevans edited the task description. (Show Details)Mar 8 2017, 8:48 PM
Eevans claimed this task.Mar 8 2017, 8:53 PM

I believe it is sufficient to encrypt w/o requiring certificate authentication (we already require a password). We should ultimately require encryption, but that will need to be done in a second step (a clean migration will require it to be optional to start out).

Eevans edited the task description. (Show Details)Mar 8 2017, 11:31 PM
Eevans added a comment.Mar 9 2017, 4:29 PM

From the ops-services-syncup meeting today, consensus was to reuse the internode encryption key/certs on the server side, and to not configure certificate-based authentication (meaning we do not need client-side key/certs); I will prepare the changesets.

Change 342075 had a related patch set uploaded (by Eevans):
[operations/puppet] Optional Cassandra client encryption; Enabled on RESTBase Staging

https://gerrit.wikimedia.org/r/342075

Change 342088 had a related patch set uploaded (by Eevans):
[operations/puppet] Cassanra TLS configuration for RESTBase

https://gerrit.wikimedia.org/r/342088

Change 342075 merged by Filippo Giunchedi:
[operations/puppet] Optional Cassandra client encryption; Enabled on RESTBase Staging

https://gerrit.wikimedia.org/r/342075

Mentioned in SAL (#wikimedia-operations) [2017-03-14T16:10:40Z] <urandom> T111113: Restart Cassandra in RESTBase Staging to enable optional client encryption

Change 342679 had a related patch set uploaded (by Eevans):
[operations/puppet] [WIP] Enable cqlsh client encryption

https://gerrit.wikimedia.org/r/342679

Change 342898 had a related patch set uploaded (by Eevans):
[operations/puppet] Enable Cassandra client encryption in RESTBase production

https://gerrit.wikimedia.org/r/342898

Change 342903 had a related patch set uploaded (by Eevans):
[operations/puppet] [WIP]: Enable encrypted client connections in RESTBase production

https://gerrit.wikimedia.org/r/342903

Change 342904 had a related patch set uploaded (by Eevans):
[operations/puppet] [WIP]: Mandatory Cassandra client encryption

https://gerrit.wikimedia.org/r/342904

Change 342088 merged by Dzahn:
[operations/puppet] Cassandra TLS configuration for RESTBase

https://gerrit.wikimedia.org/r/342088

Mentioned in SAL (#wikimedia-operations) [2017-03-15T20:38:45Z] <urandom> T111113: Restarting xenon (RESTBase Staging) to enable client encryption (canary)

Change 342912 had a related patch set uploaded (by Eevans):
[operations/puppet] Use empty 'ca' directive, not 'cert'

https://gerrit.wikimedia.org/r/342912

Change 342912 merged by Dzahn:
[operations/puppet] Use empty 'ca' directive, not 'cert'

https://gerrit.wikimedia.org/r/342912

Change 342898 merged by Dzahn:
[operations/puppet] Enable (optional) Cassandra client encryption in RESTBase production

https://gerrit.wikimedia.org/r/342898

Mentioned in SAL (#wikimedia-operations) [2017-03-16T21:19:33Z] <urandom> T111113: Restarting Cassandra on restbase1007-a to enable (optional) client encryption

Mentioned in SAL (#wikimedia-operations) [2017-03-16T21:36:23Z] <urandom> T111113: Restarting Cassandra on restbase1007-{b,c} to enable (optional) client encryption

Mentioned in SAL (#wikimedia-operations) [2017-03-16T21:50:33Z] <urandom> T111113: Rolling restarts of Cassandra in codfw, rack 'b'

Mentioned in SAL (#wikimedia-operations) [2017-03-16T22:34:12Z] <urandom> T111113: Rolling restarts of Cassandra in codfw, rack 'a'

Mentioned in SAL (#wikimedia-operations) [2017-03-16T23:24:34Z] <urandom> T111113: Rolling restarts of Cassandra in codfw, rack 'b'

Mentioned in SAL (#wikimedia-operations) [2017-03-16T23:24:42Z] <urandom> T111113: Rolling restarts of Cassandra in codfw, rack 'd' *correction*

Mentioned in SAL (#wikimedia-operations) [2017-03-17T00:03:10Z] <urandom> T111113: Rolling restarts of Cassandra on restbase1010

Mentioned in SAL (#wikimedia-operations) [2017-03-17T00:13:10Z] <urandom> T111113: Rolling restarts of Cassandra on restbase1011

Mentioned in SAL (#wikimedia-operations) [2017-03-17T00:23:07Z] <urandom> T111113: Rolling restarts of Cassandra on restbase1016

Mentioned in SAL (#wikimedia-operations) [2017-03-17T00:34:09Z] <urandom> T111113: Rolling restarts of Cassandra, eqiad, rack 'b'

Mentioned in SAL (#wikimedia-operations) [2017-03-17T01:12:18Z] <urandom> T111113: Rolling restarts of Cassandra, eqiad, rack 'd'

Mentioned in SAL (#wikimedia-operations) [2017-03-17T01:54:40Z] <urandom> T111113: Rolling restarts of Cassandra complete

Change 342903 merged by Filippo Giunchedi:
[operations/puppet] Enable encrypted client connections in RESTBase production

https://gerrit.wikimedia.org/r/342903

Mentioned in SAL (#wikimedia-operations) [2017-03-21T16:11:18Z] <urandom> T111113: Enabling RESTBase client encryption on restbase2001.codfw.wmnet (canary)

Mentioned in SAL (#wikimedia-operations) [2017-03-21T16:17:41Z] <urandom> T111113: Enabling RESTBase client encryption on (remaining) codfw nodes

Mentioned in SAL (#wikimedia-operations) [2017-03-21T16:41:46Z] <urandom> T111113: Rolling restart of RESTBase, codfw, complete

Mentioned in SAL (#wikimedia-operations) [2017-03-21T16:52:06Z] <urandom> T111113: Rolling restart of RESTBase, eqiad

Mentioned in SAL (#wikimedia-operations) [2017-03-21T17:02:21Z] <urandom> T111113: Rolling restart of RESTBase, eqiad, complete

Change 342679 merged by Filippo Giunchedi:
[operations/puppet@production] Enable cqlsh client encryption

https://gerrit.wikimedia.org/r/342679

Mentioned in SAL (#wikimedia-operations) [2017-03-23T16:10:35Z] <urandom> T111113: Live-hacking client encryption to be non-optional, to verify cqlsh encryption, restbase1007-a.eqiad.wmnet

Change 344431 had a related patch set uploaded (by Eevans):
[operations/debs/cassandra-tools-wmf@master] Cope with client encryption when so-enabled

https://gerrit.wikimedia.org/r/344431

Change 344431 merged by Filippo Giunchedi:
[operations/debs/cassandra-tools-wmf@master] Cope with client encryption when so-enabled

https://gerrit.wikimedia.org/r/344431

Change 342904 merged by Filippo Giunchedi:
[operations/puppet@production] Mandatory Cassandra client encryption

https://gerrit.wikimedia.org/r/342904

Mentioned in SAL (#wikimedia-operations) [2017-03-28T16:19:08Z] <urandom> T111113: Restarting Cassandra on restbase2001 to apply mandatory client encryption (canary)

Mentioned in SAL (#wikimedia-operations) [2017-03-28T16:39:20Z] <urandom> T111113: Restarting remaining Cassandra instances, rack 'b', codfw (restbase20{02,07,10})

Mentioned in SAL (#wikimedia-operations) [2017-03-28T17:53:19Z] <urandom> T111113: Restarting Cassandra instances, codfw row 'c'

Mentioned in SAL (#wikimedia-operations) [2017-03-28T18:44:19Z] <urandom> T111113: Restarting Cassandra instances, codfw row 'c' {{done}}

Mentioned in SAL (#wikimedia-operations) [2017-03-28T18:45:39Z] <urandom> T111113: Restarting Cassandra instances, codfw row 'd'

Mentioned in SAL (#wikimedia-operations) [2017-03-28T19:33:13Z] <urandom> T111113: Restarting Cassandra instances, codfw row 'd' {{done}}

Mentioned in SAL (#wikimedia-operations) [2017-03-28T20:21:59Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'a'

Mentioned in SAL (#wikimedia-operations) [2017-03-28T21:08:18Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'a' {{done}}

Mentioned in SAL (#wikimedia-operations) [2017-03-28T21:08:29Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'b'

Mentioned in SAL (#wikimedia-operations) [2017-03-28T21:55:06Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'b' {{done}}

Mentioned in SAL (#wikimedia-operations) [2017-03-28T21:55:30Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'd'

Mentioned in SAL (#wikimedia-operations) [2017-03-28T22:45:08Z] <urandom> T111113: Restarting Cassandra instances, eqiad row 'd' {{done]}

Eevans closed this task as "Resolved".Mar 28 2017, 11:29 PM

Done.