Cassandra inter-node encryption (TLS)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Eevans
	Aug 13 2015, 3:13 PM

Description

Cassandra supports TLS encryption of traffic between the nodes of a cluster. We do not currently encrypt inter-node traffic, but in the interest of security, must as a prerequisite to a multi-DC configuration.

Current status (2015-09-24): internode_encryption: dc, (encryption between the eqiad and codfw data-centers) is in place. internode_encryption: all (encryption between nodes in each of eqiad and codfw) remains to be done.

Details

Subject	Repo	Branch	Lines +/-
cassandra: enable inter-dc encryption	operations/puppet	production	+2 -0
cassandra: enable ssl_storage_port (7001) in ferm	operations/puppet	production	+6 -0
cassandra: enable DC internode encryption for test cluster	operations/puppet	production	+4 -0
cassandra: install certs and CA from private.git	operations/puppet	production	+53 -10
cassandra: new class ca_manager	operations/puppet	production	+24 -0
certificate/keystore generation script	operations/puppet	production	+382 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T108613 Set up multi-DC replication for Cassandra
Resolved	Eevans	T94329 secure Cassandra/RESTBase cluster
Resolved	LSobanski	T111653 Encrypt all the things
Resolved	Eevans	T108953 Cassandra inter-node encryption (TLS)
Resolved	RobH	T111382 codfw 3x spares for cassandra encryption testing

Event Timeline

Eevans created this task.Aug 13 2015, 3:13 PM

Eevans claimed this task.

Eevans raised the priority of this task from to High.

Eevans updated the task description. (Show Details)

Eevans added projects: acl*sre-team, RESTBase, RESTBase-Cassandra.

Eevans added subscribers: mark, BBlack, Joe and 6 others.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 3:13 PM

@Eevans a few questions:

how should the certs be created (I mean, what should be the name on the cert)?
is there some limitation/ reccomendation on key lengths, etc?
do those certs need to be signed by a CA? if so, I guess we would need to rebuild the java keystore for that. Is that right?

depending on the answers, we might make use of the puppet-generated host certs or not.

In T108953#1535313, @Joe wrote:

@Eevans a few questions:

how should the certs be created (I mean, what should be the name on the cert)?

I don't think this matters (if I understand the question correctly).

is there some limitation/ reccomendation on key lengths, etc?

People typically use the keytool default of 2048 for RSA keys.

do those certs need to be signed by a CA? if so, I guess we would need to rebuild the java keystore for that. Is that right?

~~I don't think this necessary; TTBMK, Cassandra is relying entirely on the the explicitly configured truststore (meaning, I don't think we can avoid this step).~~

Edit: To be clear, I don't think it would be any problem to use CA signed certificates, even if people generally just add the certs they are interested in to the truststore.

depending on the answers, we might make use of the puppet-generated host certs or not.

It sounds like it's possible to import an externally generated key, (see for example: http://stackoverflow.com/a/8224863).

So (for demonstrative purposes only), here is how the process of ,truststore generation usually goes:

#!/bin/bash                                                                                             

for dc in eqiad codfw; do
    for i in {1..9}; do
        node=restbase100$i.$dc

        # Generate node-specific keystore with (2048 byte) RSA key, (lasts 1yr)                         
        keytool \
            -genkeypair \
            -dname "cn=Eric Evans, ou=Services, o=WMF, c=US" \
            -keyalg RSA \
            -alias $node \
            -validity 365 \
            -keypass qwerty \
            -storepass pass123 \
            -keystore $node.ks

        # Export the certificate generated above, in RFC 1421 format                                    
        keytool \
            -exportcert \
            -rfc \
            -alias $node \
            -file $node.crt \
            -storepass pass123 \
            -keystore $node.ks

        # Import the previously exported cert into the truststore                                       
        keytool \
            -importcert \
            -v \
            -trustcacerts \
            -noprompt \
            -alias $node \
            -file $node.crt \
            -storepass pass123 \
            -keystore truststore
    done
done

And the above would be configured in Cassandra with something like the following (in cassandra.yaml):

...

server_encryption_options:
    internode_encryption: all
    keystore: conf/restbase1001.eqiad.ks
    keystore_password: qwerty
    truststore: conf/truststore
    truststore_password: pass123
    require_client_auth: true

...

looks like we could do the following:

for each host in a given cassandra cluster, generate its public/private keypair, add the public key to the trusted store and commit the result to private.git, distribute via puppet as usual
trust the puppet CA instead and let puppet generate public/private keypair on the host signed by puppet CA. This would work, however we'll need to control which hosts are trusted anyway or any host running puppet will be trusted in theory

The latter is what I'd like to do for the client auth and varnish<->varnish parts of T108580 as well, but one of the outstanding issues there is that our puppet keys are currently 4K RSA, and we'd rather them be 2K RSA. 4K is unnecessary and very very slow, basically. It affects CPU utilization on the puppetmasters to boot. The question is how much of a PITA it will be to redo puppet keys again.

Eevans updated the task description. (Show Details)Aug 17 2015, 8:23 PM

Eevans set Security to None.

In T108953#1539824, @BBlack wrote:

The latter is what I'd like to do for the client auth and varnish<->varnish parts of T108580 as well, but one of the outstanding issues there is that our puppet keys are currently 4K RSA, and we'd rather them be 2K RSA. 4K is unnecessary and very very slow, basically. It affects CPU utilization on the puppetmasters to boot. The question is how much of a PITA it will be to redo puppet keys again.

4K is puppet's default unless keylength is specified, https://docs.puppetlabs.com/references/latest/configuration.html#keylength (default changed in https://github.com/puppetlabs/puppet/pull/498)
The procedure we followed a couple of months ago is https://wikitech.wikimedia.org/wiki/Puppet_CA_replacement essentially:

regenerate puppet CA and the master's certs
backup client keys for hosts running backups since they are used for per-host encryption
wipe client keys and ask puppet clients to ask for certs again
update users of puppet private keys with new keys, currently bacula and ipsec

There doesn't seem to be any native support for puppet CA rollovers, anyways it seems like we need to do this anyway if we go down the road of reusing puppet host certs for higher traffic applications, bigger keys might also impact bacula restore times

4K is puppet's default unless keylength is specified, https://docs.puppetlabs.com/references/latest/configuration.html#keylength (default changed in https://github.com/puppetlabs/puppet/pull/498)
The procedure we followed a couple of months ago is https://wikitech.wikimedia.org/wiki/Puppet_CA_replacement essentially:

regenerate puppet CA and the master's certs

backup client keys for hosts running backups since they are used for per-host encryption

wipe client keys and ask puppet clients to ask for certs again

update users of puppet private keys with new keys, currently bacula and ipsec

users also include etcd

so for cassandra and multiple instances using puppet certificates means we'd be sharing the hosts' cert among instances, though each instance is bound to a different ip address so validation might not work because forward/reverse address resolution won't match, for example

Eevans renamed this task from Cassandra internode TLS encryption to Cassandra encryption (TLS).Sep 1 2015, 9:15 PM

Eevans updated the task description. (Show Details)

Eevans mentioned this in T111113: Cassandra client encryption.Sep 1 2015, 9:39 PM

Eevans renamed this task from Cassandra encryption (TLS) to Cassandra inter-node encryption (TLS).Sep 1 2015, 9:42 PM

Eevans updated the task description. (Show Details)

Any word here? Encryption is a dependency for scaling RESTBase into codfw, a quarterly goal for Services (T102306), and time is starting to run short.

I'm happy to work on this if I'm able.

so I don't think we'll be able to reuse puppet certs with multiple instances, thus we'll have to roll our own. since we are going to use multiple cassandra clusters it'd be ideal to keep the keyrings isolated per-cluster. I'm proposing the following:

map cassandra cluster -> list of nodes belonging to the cluster (which we need anyway)
having the above, generate private+public keys for each host and commit to private.git (via a script)
using the same map, assemble per-cluster keyrings, commit in private.git
puppet will do file distribution via secret() or similar mechanisms

naming clarifications:

truststore is used for the public parts, doesn't change across the cluster (note truststore is a name used by cassandra only, to keytool only keystores exist)
keystore stores public/private keys for an host/instance, distributed individually

e.g. on the machine hosting private.git we would run

# assuming we know the mapping cluster -> list of hosts/instances
# assuming passwords are already generated and stored in storepass/c1.pass
# generates keypair for all hosts in keystore/c1/<host>.ks using the respective keystore password
$ cassandra-keytool gen-ks c1 
# assemble all c1 keypairs into truststore/c1.ts
$ cassandra-keytool gen-truststore c1
$ git commit -m "update cassandra keys" keystore truststore

puppet would then for each host/instance copy the keystore and truststore via secret() in the proper location inside e.g. /etc and configure cassandra accordingly. Ditto for the keystore password, also fetched via secret(). For each host/instance puppet will also set up monitoring for cert expiration deadlines.

In T108953#1596328, @fgiunchedi wrote:
so I don't think we'll be able to reuse puppet certs with multiple instances, thus we'll have to roll our own. since we are going to use multiple cassandra clusters it'd be ideal to keep the keyrings isolated per-cluster. I'm proposing the following:

map cassandra cluster -> list of nodes belonging to the cluster (which we need anyway)

having the above, generate private+public keys for each host and commit to private.git (via a script)

using the same map, assemble per-cluster keyrings, commit in private.git

puppet will do file distribution via secret() or similar mechanisms

naming clarifications:

truststore is used for the public parts, doesn't change across the cluster (note truststore is a name used by cassandra only, to keytool only keystores exist)

keystore stores public/private keys for an host/instance, distributed individually

e.g. on the machine hosting private.git we would run
# assuming we know the mapping cluster -> list of hosts/instances
# assuming passwords are already generated and stored in storepass/c1.pass
# generates keypair for all hosts in keystore/c1/<host>.ks using the respective keystore password
$ cassandra-keytool gen-ks c1 
# assemble all c1 keypairs into truststore/c1.ts
$ cassandra-keytool gen-truststore c1
$ git commit -m "update cassandra keys" keystore truststore
puppet would then for each host/instance copy the keystore and truststore via secret() in the proper location inside e.g. /etc and configure cassandra accordingly. Ditto for the keystore password, also fetched via secret(). For each host/instance puppet will also set up monitoring for cert expiration deadlines.

FYI, we have T111113 as well.

For this, I assume that we'll need another key per host (for the client), and PEM copies of all the Cassandra instance certs (to use in the CA array). Additionally, we'll need copies of all the client certs imported into a truststore (either the one above, or another for this purpose), to reference in cassandra.yaml when enabling client encryption.

In T108953#1596768, @Eevans wrote:

FYI, we have T111113 as well.

For this, I assume that we'll need another key per host (for the client), and PEM copies of all the Cassandra instance certs (to use in the CA array). Additionally, we'll need copies of all the client certs imported into a truststore (either the one above, or another for this purpose), to reference in cassandra.yaml when enabling client encryption.

good point, we should be taking client auth in consideration too for T76494: Manage cross-DC replication according to network topology

In T108953#1596328, @fgiunchedi wrote:
so I don't think we'll be able to reuse puppet certs with multiple instances, thus we'll have to roll our own. since we are going to use multiple cassandra clusters it'd be ideal to keep the keyrings isolated per-cluster. I'm proposing the following:

map cassandra cluster -> list of nodes belonging to the cluster (which we need anyway)

having the above, generate private+public keys for each host and commit to private.git (via a script)

using the same map, assemble per-cluster keyrings, commit in private.git

puppet will do file distribution via secret() or similar mechanisms

naming clarifications:

truststore is used for the public parts, doesn't change across the cluster (note truststore is a name used by cassandra only, to keytool only keystores exist)

keystore stores public/private keys for an host/instance, distributed individually

e.g. on the machine hosting private.git we would run
# assuming we know the mapping cluster -> list of hosts/instances
# assuming passwords are already generated and stored in storepass/c1.pass
# generates keypair for all hosts in keystore/c1/<host>.ks using the respective keystore password
$ cassandra-keytool gen-ks c1 
# assemble all c1 keypairs into truststore/c1.ts
$ cassandra-keytool gen-truststore c1
$ git commit -m "update cassandra keys" keystore truststore
puppet would then for each host/instance copy the keystore and truststore via secret() in the proper location inside e.g. /etc and configure cassandra accordingly. Ditto for the keystore password, also fetched via secret(). For each host/instance puppet will also set up monitoring for cert expiration deadlines.

thinking about it more, this scheme would make certs rollover harder since we'll need to have both new+old cert in the truststore. using a CA and having only that in the truststore might simplify things, upon CA rollover we add a new CA to the truststore and issue certs signed with the new CA. cc @akosiaris @BBlack @faidon for opinions

to recap:

we want encryption among cassandra nodes (by "node" I mean a single jvm, multiple instances will mean a key pair for each instance)
- ideally nodes from different cassandra clusters can't talk to each other
this is required to expand into codfw, we can tune cassandra to encrypt only dc-to-dc traffic but leave inter-dc traffic unencrypted (e.g. while we roll out this)
cassandra requires a keystore holding a key pair with the node identity and a truststore holding the trusted certificates, manipulated via keytool

Eevans mentioned this in T108613: Set up multi-DC replication for Cassandra.Sep 2 2015, 3:44 PM

@fgiunchedi, here is the background work I promised; I do not know if this is the only way, but this does demonstrate a working example of using an openssl generated CA cert, the corresponding truststore, and node/instance keystores containing CA-signed client certificates.

Makefile1 KBDownload

For demonstrative purposes only:

$ make truststore
...
$ make node1 node2 node3
...
$ ls 
ca.crt  ca.srl    node1.crt  node2.crt  node3.crt  truststore
ca.key  Makefile  node1.kst  node2.kst  node3.kst

And the corresponding Cassandra configuration:

server_encryption_options:
    internode_encryption: all
    keystore: /path/to/keystore.kst
    keystore_password: password
    truststore: /path/to/truststore
    truststore_password: password
    require_client_auth: true

Let me know if you think this approach will work, and if so, what I can do (put together a proper management script, perhaps?).

thanks @Eevans! I think that'll work, basically to generalize it on a per-cluster basis something like:

I was thinking something along the lines of (all in private.git) a directory for each cassandra cluster, containing subdirectories for:

per-server server keystore
CA private/public key(s)
(in the future) per-client certificates/keystores

one of the missing bits is "configuration" or IOW a manifest of what certs/keystores should be generated for each cluster (so we can in theory generate them unattended)
anyhow upon adding a machine we'd issue the related cert from the CA, commit, and install it on the machine via secret().
CA rollover can be done by adding a new trusted CA to the truststore, which can be also installed via secret() and reissue certs with the new CA (should apply the same to cert and server clients)
for labs testing the above would require a cert for each node/instance, perhaps we can wildcard it and use a single bogus cert/CA ?
another open question is what to do with keystore passwords, we're operating on the basis that puppet has access to the files (and it'd need to know the keystore password anyway) and I'm not sure we're encrypting private.git material at rest

In T108953#1601771, @fgiunchedi wrote:

thanks @Eevans! I think that'll work, basically to generalize it on a per-cluster basis something like:
[ ... ]
another open question is what to do with keystore passwords, we're operating on the basis that puppet has access to the files (and it'd need to know the keystore password anyway) and I'm not sure we're encrypting private.git material at rest

Yeah, good question.

I'm not sure how useful these passwords are from a security perspective. On the Cassandra-side of things, the password will be in the clear in cassandra.yaml; Any attacker that obtains a copy of the keystore there will probably be able to get the passwords too.

In T108953#1597221, @fgiunchedi wrote:

thinking about it more, this scheme would make certs rollover harder since we'll need to have both new+old cert in the truststore. using a CA and having only that in the truststore might simplify things, upon CA rollover we add a new CA to the truststore and issue certs signed with the new CA.

Using a custom CA sounds good to me.

As discussed on IRC, keeping the keystore passwords on palladium seems acceptable to me, since they need to be stored in plaintext in cassandra.yaml on every restbase* node anyway.

Change 236389 had a related patch set uploaded (by Eevans):
WIP: certificate/keystore generation script

https://gerrit.wikimedia.org/r/236389

gerritbot added a project: Patch-For-Review.Sep 6 2015, 12:28 AM

The attached Gerrit is for a script to generate a root CA and signed keystores, based on the contents of a cluster-specific manifest. It's still a work-in-progress, sparse on error handling, and isn't secure in its handling of passwords, but I'm posting it now to get early feedback.

faidon merged a task: T94132: cassandra - enable Inter-node encryption.Sep 6 2015, 8:49 PM

faidon added a parent task: T94329: secure Cassandra/RESTBase cluster.

faidon added a subscriber: Dzahn.

faidon added a parent task: T111653: Encrypt all the things.Sep 6 2015, 8:59 PM

Eevans mentioned this in T108611: Perform initial (manual) repair of Cassandra cluster.Sep 8 2015, 10:37 PM

Change 236389 merged by Filippo Giunchedi:
certificate/keystore generation script

https://gerrit.wikimedia.org/r/236389

fgiunchedi mentioned this in rOPUP956a7215dbdf: certificate/keystore generation script.Sep 10 2015, 9:08 AM

Change 237377 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: new class ca_manager

https://gerrit.wikimedia.org/r/237377

Change 237397 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: install certs and CA from private.git

https://gerrit.wikimedia.org/r/237397

Change 237377 merged by Filippo Giunchedi:
cassandra: new class ca_manager

https://gerrit.wikimedia.org/r/237377

fgiunchedi mentioned this in rOPUP6eae3163be71: cassandra: new class ca_manager.Sep 11 2015, 8:55 AM

Change 237397 merged by Filippo Giunchedi:
cassandra: install certs and CA from private.git

https://gerrit.wikimedia.org/r/237397

fgiunchedi mentioned this in rOPUP30aad3531d0f: cassandra: install certs and CA from private.git.Sep 11 2015, 8:56 AM

Change 237648 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: enable DC internode encryption for test cluster

https://gerrit.wikimedia.org/r/237648

fgiunchedi closed subtask T112257: rename cassandra cluster as Resolved.Sep 14 2015, 9:07 AM

Change 237648 merged by Filippo Giunchedi:
cassandra: enable DC internode encryption for test cluster

https://gerrit.wikimedia.org/r/237648

fgiunchedi mentioned this in rOPUP2ceeb9b2e65c: cassandra: enable DC internode encryption for test cluster.Sep 14 2015, 12:28 PM

Change 238144 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: enable ssl_storage_port (7001) in ferm

https://gerrit.wikimedia.org/r/238144

Change 238144 merged by Filippo Giunchedi:
cassandra: enable ssl_storage_port (7001) in ferm

https://gerrit.wikimedia.org/r/238144

fgiunchedi mentioned this in rOPUP902db460655b: cassandra: enable ssl_storage_port (7001) in ferm.Sep 14 2015, 2:02 PM

fgiunchedi closed subtask T111382: codfw 3x spares for cassandra encryption testing as Resolved.Sep 15 2015, 11:43 AM

Change 239794 had a related patch set uploaded (by Filippo Giunchedi):
cassandra: enable inter-dc encryption

https://gerrit.wikimedia.org/r/239794

Change 239794 merged by Filippo Giunchedi:
cassandra: enable inter-dc encryption

https://gerrit.wikimedia.org/r/239794

fgiunchedi mentioned this in rOPUPab43e9879a04: cassandra: enable inter-dc encryption.Sep 21 2015, 12:49 PM

fgiunchedi reopened subtask T112257: rename cassandra cluster as Open.Sep 22 2015, 5:04 PM

we're live with inter-dc encryption for cassandra in production and test, we are still missing a way to track expiration of certs/ca, akin to T112542: audit all SSL certificates expiry on ops tracking gcal for public certs

Eevans updated the task description. (Show Details)Sep 24 2015, 4:30 PM

fgiunchedi removed a subtask: T112257: rename cassandra cluster.Sep 25 2015, 3:35 PM