Page MenuHomePhabricator

Replace cassandra-ca-manager with PKI
Closed, ResolvedPublic

Description

We use cergen for certificate generation in the private repo in most places. We currently still use cassandra-ca-manager for Cassandra and should migrate to using cergen to keep in step with other projects.

Via T329951: Replace expiring Cassandra TLS certificates (restbase[1019-1027]):

"...we could try to use PKI for Cassandra? It would make the cert renewal process less tedious for sure, puppet would take care of most of the burden..." -- @elukey

"...we have an implementation of cfssl (see more details in https://wikitech.wikimedia.org/wiki/PKI/CA_Operations), I am proposing to add a new intermediate (like we did also for Kafka brokers) and use puppet to request certficates from it when needed (like when they are expiring etc..). We'd need to study a way to migrate the clusters over the new certs (and also clients using TLS, if any) but it should be doable :)" -- @elukey

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -2
operations/puppetproduction+3 -2
operations/puppetproduction+25 -9
operations/puppetproduction+25 -16
operations/puppetproduction+0 -1
operations/puppetproduction+16 -8
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+21 -6
operations/puppetproduction+4 -0
operations/puppetproduction+9 -7
operations/puppetproduction+4 -0
labs/privatemaster+1 -2
operations/puppetproduction+1 -0
operations/puppetproduction+15 -0
operations/puppetproduction+127 -71
operations/puppetproduction+7 -0
operations/puppetproduction+22 -0
operations/puppetproduction+1 -0
Show related patches Customize query in gerrit

Event Timeline

I thought that cergen was being deprecated now in favour of the new cfssl based PKI module.

We've recently concluded a project to roll out these certificates to Kafka brokers (e.g. T300130), so it would make sense to me if this system were also used for Cassandra as well.

It's just a thought. I think that @jbond knows most about it and @elukey has a lot of experience rolling it out to Kafka recently. I've used it to a smaller extent with Hive.

@BTullis thanks and yes i agree any new migrations should go directly to pki.discovery.wmnet. Happy to help

Thanks @jbond - Do you think it would be better to create a new intermediary for Cassandra, similar to the way we did for Kafka?

Thanks @jbond - Do you think it would be better to create a new intermediary for Cassandra, similar to the way we did for Kafka?

Yes exactly, see https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Adding_a_new_intermediate

The tricky bit is making sure that clients support the Root PKI CA, but I agree that it would be a great improvement for Cassandra!

Eevans renamed this task from Replace cassandra-ca-manager with cergen to Replace cassandra-ca-manager with PKI .Feb 23 2023, 4:59 PM
Eevans triaged this task as Medium priority.
Eevans updated the task description. (Show Details)

Change 931264 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pki::root_ca: add intermediate for Cassandra

https://gerrit.wikimedia.org/r/931264

Change 931264 merged by Elukey:

[operations/puppet@production] profile::pki::root_ca: add intermediate for Cassandra

https://gerrit.wikimedia.org/r/931264

Change 931267 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::pkie::intermediates: add the cassandra public certificate

https://gerrit.wikimedia.org/r/931267

Change 931267 merged by Elukey:

[operations/puppet@production] profile::pki::intermediates: add the cassandra public certificate

https://gerrit.wikimedia.org/r/931267

Change 931272 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki::multirootca: add the cassandra intermediate

https://gerrit.wikimedia.org/r/931272

Change 931272 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add the cassandra intermediate

https://gerrit.wikimedia.org/r/931272

Created the new intermediate certificate CA called cassandra in our PKI infra.

Change 931276 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra: add initial support for PKI TLS certs to 4.x

https://gerrit.wikimedia.org/r/931276

Change 931292 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_cache::storage: enable PKI tls certs

https://gerrit.wikimedia.org/r/931292

The idea of https://gerrit.wikimedia.org/r/c/operations/puppet/+/931276 is to allow the usage of PKI-based keystores, to replace the custom ones that we build for cassandra.

In order to support the transition from custom CA to PKI, we need to provide each node with a trust-store that can verify PKI and custom CA certs, so we leverage the profile::base::certificate config to build a custom bundle in a java trust store.

Change 931903 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_cache::storage: use pki truststore

https://gerrit.wikimedia.org/r/931903

Filed all changes. To upgrade a cluster, this is the idea:

  1. A new truststore is rolled out, containing Root PKI cert and Custom CA cert to all nodes of a Cassandra cluster.
  2. We upgrade one node at the time, adding its new PKI-based keystore to verify that everything works correctly.

Change 931276 merged by Elukey:

[operations/puppet@production] cassandra: add initial support for PKI TLS certs to 4.x

https://gerrit.wikimedia.org/r/931276

Change 931903 merged by Elukey:

[operations/puppet@production] role::ml_cache::storage: use pki truststore

https://gerrit.wikimedia.org/r/931903

Change 931292 merged by Elukey:

[operations/puppet@production] role::ml_cache::storage: enable PKI tls certs

https://gerrit.wikimedia.org/r/931292

Change 932378 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cassandra: add hiera option for the TLS keystore password

https://gerrit.wikimedia.org/r/932378

Change 932379 had a related patch set uploaded (by Elukey; author: Elukey):

[labs/private@master] role::ml_cache::storage: add fake TLS keystore password for PKI

https://gerrit.wikimedia.org/r/932379

Change 932379 merged by Elukey:

[labs/private@master] role::ml_cache::storage: add fake TLS keystore password for PKI

https://gerrit.wikimedia.org/r/932379

Change 932381 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra: allow to set the keystore password for 4.x

https://gerrit.wikimedia.org/r/932381

Change 932381 merged by Elukey:

[operations/puppet@production] cassandra: allow to set the keystore password for 4.x

https://gerrit.wikimedia.org/r/932381

Change 932378 merged by Elukey:

[operations/puppet@production] profile::cassandra: add hiera option for the TLS keystore password

https://gerrit.wikimedia.org/r/932378

Change 932384 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra: allow to update the keystore password - part two

https://gerrit.wikimedia.org/r/932384

Change 932384 merged by Elukey:

[operations/puppet@production] cassandra: allow to update the keystore password - part two

https://gerrit.wikimedia.org/r/932384

The ml-cache clusters are running with PKI!

I have updated https://wikitech.wikimedia.org/wiki/Cassandra#Installing_and_generating_certificates to reflect the new use case.

Recap of what I've done to upgrade ml-cache:

  • I have rolled out a new Truststore via profile::base::certificates to include the PKI root CA cert and the Custom CA root cert (created by cassandra-ca-manager) in a single bundle. The idea is to allow, at any given time, that a Cassandra node can validate TLS connection using certificates emitted by either one of the CAs. All cassandra nodes have been restarted to pick up the new settings.
  • I enabled PKI for Cassandra, and updated every node one by one. During the migration half of the nodes were using certs emitted by cassandra-ca-manager and the other half certs emitted by PKI (and materialized on the nodes via puppet runs), but the clusters were running fine.

Things left to verify/fix:

  1. The TLS certs emitted by the PKI cassandra intermediate CA have a month of validity, so our alarming complains about certs almost expired. We should figure out a way to fix this. Since Cassandra 4+ can automatically pick up new keystores I would extend their validity to more, but we can definitely do it if needed (for kafka we have 6 months but the version that we have doesn't pick up new keystores without restarts, so it would have been annoying to keep a month).
  2. We should verify in a month that Cassandra 4 indeed picks up new keystores without reloading. We can use the ml-cache use case to verify.
  3. We should think about Cassandra clients encrypting their connections to Cassandra nodes, since they will need to support PKI-based certs too. In theory the only change needed is to force them to use a bundle that contains the PKI root CA certificate (provided on all nodes by profile::base::certificate so easy enough).

Change 932413 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra: add support for shorter TLS cert expiry checks

https://gerrit.wikimedia.org/r/932413

Change 932413 merged by Elukey:

[operations/puppet@production] cassandra: add support for shorter TLS cert expiry checks

https://gerrit.wikimedia.org/r/932413

Change 932424 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance::monitoring: fix tcp alert

https://gerrit.wikimedia.org/r/932424

Change 932424 merged by Elukey:

[operations/puppet@production] cassandra::instance::monitoring: fix tcp alert

https://gerrit.wikimedia.org/r/932424

Change 932427 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance::monitoring: move alerts to prometheus

https://gerrit.wikimedia.org/r/932427

Change 932795 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance::monitoring: remove wrong servername

https://gerrit.wikimedia.org/r/932795

Change 932795 merged by Elukey:

[operations/puppet@production] cassandra::instance::monitoring: remove wrong servername

https://gerrit.wikimedia.org/r/932795

Change 932799 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance::monitoring: add 'cassandra' as servername

https://gerrit.wikimedia.org/r/932799

Change 932801 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance: add CN:cassandra to all PKI certs

https://gerrit.wikimedia.org/r/932801

Change 932801 merged by Elukey:

[operations/puppet@production] cassandra::instance: add CN:cassandra to all PKI certs

https://gerrit.wikimedia.org/r/932801

Change 932799 merged by Elukey:

[operations/puppet@production] cassandra::instance::monitoring: add 'cassandra' as servername

https://gerrit.wikimedia.org/r/932799

Some notes:

  • The hot reloading of the keystore seems working from the logs (namely I see some indication that the new file is recognized and loaded) but if I try to inspect the cert via openssl I see the old one and not the new. Not sure why, I'll keep investigating.
  • The prometheus blackbox tcp probes require only one SNI, so we added a cassandra SAN to allow this use case. The new alerts are working fine.
  • The current CN of everyt TLS certificate is the fqdn of the host, not the Cassandra's instance fqdn. This is problematic since a client may fail if the host verification is enabled, since it will check the CN of the cert for the Cassandra instance fqdn, not the host one. I didn't find any trace of the instance fqdns in Puppet, we may need to add them to do things properly.

@Eevans thoughts about the last one?

Change 932427 abandoned by Elukey:

[operations/puppet@production] cassandra::instance::monitoring: move alerts to prometheus

Reason:

the TLS part needs more work, will file a separate change only for cql

https://gerrit.wikimedia.org/r/932427

Change 933134 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance::monitoring: move cql check to Prometheus for PKI

https://gerrit.wikimedia.org/r/933134

Change 933224 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cassandra::instance: use the instance's fqdn as TLS PKI CN

https://gerrit.wikimedia.org/r/933224

Change 933134 merged by Elukey:

[operations/puppet@production] cassandra::instance::monitoring: move cql check to Prometheus for PKI

https://gerrit.wikimedia.org/r/933134

Some notes:

  • The current CN of everyt TLS certificate is the fqdn of the host, not the Cassandra's instance fqdn. This is problematic since a client may fail if the host verification is enabled, since it will check the CN of the cert for the Cassandra instance fqdn, not the host one. I didn't find any trace of the instance fqdns in Puppet, we may need to add them to do things properly.

@Eevans thoughts about the last one?

https://gerrit.wikimedia.org/r/933224

Change 933224 merged by Elukey:

[operations/puppet@production] cassandra::instance: use the instance's fqdn as TLS PKI CN

https://gerrit.wikimedia.org/r/933224

I rolled out the new certificate to have a different CN (ml-cache1001-a.eqiad.wmnet vs ml-cache1001.eqiad.wmnet) and I see the following on the logs:

INFO  [ScheduledTasks:1] 2023-06-27 10:14:28,302 SSLFactory.java:204 - SSL certificates have been updated for org.apache.cassandra.config.EncryptionOptions$ServerEncryptionOptions. Resetting the ssl contexts for new connections.
INFO  [ScheduledTasks:1] 2023-06-27 10:14:28,334 SSLFactory.java:204 - SSL certificates have been updated for org.apache.cassandra.config.EncryptionOptions. Resetting the ssl contexts for new connections.

That is good, but only the cql cert is updated, not the one used by cassandra instances to talk with themselves:

elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:7001 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002.eqiad.wmnet
elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:9042 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002-a.eqiad.wmnet

If I restart the instance the new cert is picked up:

elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:7001 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002-a.eqiad.wmnet

This is very problematic since we'll need to roll restart cassandra instances, so the hot reload doesn't fully work? @Eevans
am I missing something?

Change 933465 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::ml_cache::storage: remove legacy_ssl_storage_port_enabled setting

https://gerrit.wikimedia.org/r/933465

Change 933465 merged by Elukey:

[operations/puppet@production] role::ml_cache::storage: remove legacy_ssl_storage_port_enabled setting

https://gerrit.wikimedia.org/r/933465

I rolled out the new certificate to have a different CN (ml-cache1001-a.eqiad.wmnet vs ml-cache1001.eqiad.wmnet) and I see the following on the logs:

INFO  [ScheduledTasks:1] 2023-06-27 10:14:28,302 SSLFactory.java:204 - SSL certificates have been updated for org.apache.cassandra.config.EncryptionOptions$ServerEncryptionOptions. Resetting the ssl contexts for new connections.
INFO  [ScheduledTasks:1] 2023-06-27 10:14:28,334 SSLFactory.java:204 - SSL certificates have been updated for org.apache.cassandra.config.EncryptionOptions. Resetting the ssl contexts for new connections.

That is good, but only the cql cert is updated, not the one used by cassandra instances to talk with themselves:

elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:7001 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002.eqiad.wmnet
elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:9042 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002-a.eqiad.wmnet

If I restart the instance the new cert is picked up:

elukey@ml-cache1002:~$ echo y | openssl s_client -connect ml-cache1002-a.eqiad.wmnet:7001 2>&1 | grep "s:CN"
 0 s:CN = ml-cache1002-a.eqiad.wmnet

This is very problematic since we'll need to roll restart cassandra instances, so the hot reload doesn't fully work? @Eevans
am I missing something?

After setting legacy_ssl_storage_port_enabled to false I've cleaned up tls files on a node, ran puppet and the new cert was picked up by both ports correctly! \o/

elukey claimed this task.

Ok so at this point the support is completed, we just need to figure out if we want to migrate clusters or not.

Wrong action, let's figure out if we want to migrate other clusters first :)

@Eevans Do you think that we could migrate AQS and Session store to PKI? If so I can open new tasks and propose a plan :)

@Eevans Do you think that we could migrate AQS and Session store to PKI? If so I can open new tasks and propose a plan :)

That would be great; Please do!