Page MenuHomePhabricator

Move Kafka main to the new intermediate PKI CA
Closed, ResolvedPublic

Description

Hi folks!

In the parent task we worked on a strategy to move Kafka brokers from Puppet-based TLS certs to PKI-based certs (new intermediate Kafka CA created for the use case). In T300130 kafka logging was moved to the new CA successfully, and it would be great to do the same to Kafka main as well.

I am going to list what is the rollout plan that I have used for Kafka logging:

Find Kafka clients and upgrade their trusted CA settings

The first step is to find Kafka clients and upgrade their settings to trust both the Puppet CA and the Root PKI CA certificates. Thanks to John we have a bundle in various formats on all hosts created by the wmf-certificates package and puppet:

elukey@kafka-logging1001:~$ file /etc/ssl/certs/wmf-ca-certificates.crt
/etc/ssl/certs/wmf-ca-certificates.crt: PEM certificate

# This one needs a hiera setting:
# profile::base::certificates::include_bundle_jks: true
elukey@kafka-logging1001:~$ file /etc/ssl/localcerts/wmf-java-cacerts 
/etc/ssl/localcerts/wmf-java-cacerts: Java KeyStore
Update Kafka settings on all brokers to allow both PKI and Puppet TLS certs at the same time

The only thing needed is the following:

profile::kafka::broker:use_pki_migration_settings: true
profile::base::certificates::include_bundle_jks: true

The settings will update Kafka's super.users setting (basically the TLS CN that are trusted between brokers) and /etc/ssl/localcerts/wmf-java-cacerts will be deployed by puppet (JKS truststore with the PKI and Puppet CA certificates).

Roll restart of all brokers is needed.

Update TLS settings one broker at the time

A hiera host-specific setting should suffice:

profile::kafka::broker::ssl_generate_certificates: true

The above will request a new PKI TLS certificate, deploy it to the node via puppet and update the Kafka settings.

A restart of the affected Kafka broker is needed.

Clean up

Remove the hiera setting used to allow both PKI and Puppet CA certs:

- profile::kafka::broker:use_pki_migration_settings: true

Roll restart of all brokers is needed.

Then finally clean up old TLS certificates from puppet private (revoking them too).

What is it going to change after the move to PKI ?

The only annoyance is that every 6 months we'll need to run the kafka roll restart cookbook to pick up the new TLS certificates, since the PKI ones last 6 months for the moment. This is due to the current version of Kafka that doesn't allow hot reload of keystores: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Renew_TLS_certificate

Event Timeline

Change 901118 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::purge: move purged to a new CA bundle

https://gerrit.wikimedia.org/r/901118

Change 901118 merged by Elukey:

[operations/puppet@production] profile::cache::purge: move purged to a new CA bundle

https://gerrit.wikimedia.org/r/901118

Mentioned in SAL (#wikimedia-operations) [2023-03-21T08:31:47Z] <elukey> move purged daemons on cp nodes to a new CA bundle (to allow accepting kafka clients using PKI tls certs) - T319372

Change 901547 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::mirror: default to use pki migration settings

https://gerrit.wikimedia.org/r/901547

Change 901551 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::main: deploy PKI migration settings

https://gerrit.wikimedia.org/r/901551

Change 901547 merged by Elukey:

[operations/puppet@production] profile::kafka::mirror: default to use pki migration settings

https://gerrit.wikimedia.org/r/901547

Mentioned in SAL (#wikimedia-operations) [2023-03-21T13:05:47Z] <elukey> move kafka mirror maker instances to PKI migration settings (new truststores) - T319372

Change 901551 merged by Elukey:

[operations/puppet@production] role::kafka::main: deploy PKI migration settings

https://gerrit.wikimedia.org/r/901551

Mentioned in SAL (#wikimedia-operations) [2023-03-30T08:55:36Z] <elukey> move kafka main clusters to new truststore (PKI+Puppet root CA certs) - T319372

Change 904667 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Switch kafka-main1001 broker's TLS cert to PKI

https://gerrit.wikimedia.org/r/904667

All brokers have the new truststore, so they can validate certs emitted by PKI. Next steps:

  1. Upgrade kafka-main1001 to PKI, and monitor if any client fails to connect.
  2. Rollout the certs everywhere.

Change 904667 merged by Elukey:

[operations/puppet@production] Switch kafka-main1001 broker's TLS cert to PKI

https://gerrit.wikimedia.org/r/904667

Mentioned in SAL (#wikimedia-operations) [2023-04-03T08:29:04Z] <elukey> move kafka-main1001's kafka broker to PKI - T319372

Change 905251 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Upgrade kafka-main to use PKI TLS certificates for brokers

https://gerrit.wikimedia.org/r/905251

Change 905251 merged by Elukey:

[operations/puppet@production] Upgrade kafka-main to use PKI TLS certificates for brokers

https://gerrit.wikimedia.org/r/905251

Mentioned in SAL (#wikimedia-operations) [2023-04-05T08:07:11Z] <elukey> restart kafka on kafka-main1002 to pick up the new TLS certificate (PKI based) - T319372

Mentioned in SAL (#wikimedia-operations) [2023-04-05T09:35:31Z] <elukey> restart kafka on kafka-main1003 to pick up the new TLS certificate (PKI based) - T319372

Mentioned in SAL (#wikimedia-operations) [2023-04-05T13:52:10Z] <elukey> restart kafka on kafka-main1004 to pick up the new TLS certificate (PKI based) - T319372

Mentioned in SAL (#wikimedia-operations) [2023-04-05T14:33:37Z] <elukey> restart kafka on kafka-main1005 to pick up the new TLS certificate (PKI based) - T319372

Mentioned in SAL (#wikimedia-operations) [2023-04-06T09:30:33Z] <elukey> kafka main codfw cluster migrated to PKI TLS certs for brokers - T319372

Last steps:

  • clean up certs in puppet private
  • verify if any change is needed in deployment-prep

Mentioned in SAL (#wikimedia-operations) [2023-04-11T13:54:58Z] <elukey> remove old puppet certificates for kafka main brokers from A:kafka-main - T319372

Mentioned in SAL (#wikimedia-operations) [2023-04-11T14:00:27Z] <claime> Revoking kafka_main-codfw_broker and kafka_main-eqiad_broker puppet CA certs - T319372

Final step - check if we have to migrate deployment-prep or not. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/905954, some hiera settings may need to be added if we want to keep the old puppet cert format.

elukey claimed this task.

Deployment-prep may be migrated in the future, not in scope for this task. Finally closing!