Page MenuHomePhabricator

Allow kafka clients to verify brokers hostnames when using SSL
Closed, ResolvedPublic

Description

Kafka brokers have their certificates generated with a common CN per cluster, e.g. kafka_main-eqiad_broker.
This prevents the client to verify the hostname of the brokers it connects to:

openssl s_client -CAfile /etc/ssl/certs/ca-certificates.crt -verify_hostname kafka-main2001.codfw.wmnet kafka-main2001.codfw.wmnet:9093 <<< "Q"
CONNECTED(00000003)
depth=0 CN = kafka_main-codfw_broker
verify error:num=62:Hostname mismatch
verify return:1
[...]

Regarding existing clients:

  • kafka-python requires to explicitly disable hostname verification, c.f. P17333
  • librdkafka has hostname verification disabled by default but can be enabled setting ssl.endpoint.identification.algorithm to HTTPS, c.f. P17334
  • java clients have not been tested

Ref:

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+32 -1
operations/puppetproduction+2 -1
operations/puppetproduction+9 -3
operations/puppetproduction+3 -1
operations/puppetproduction+1 -1
operations/puppetproduction+4 -1
analytics/refinerymaster+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+4 -0
operations/puppetproduction+4 -0
operations/puppetproduction+7 -1
operations/puppetproduction+3 -3
operations/puppetproduction+0 -1
operations/puppetproduction+24 -18
operations/puppetproduction+20 -10
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+9 -0
operations/puppetproduction+24 -14
operations/puppetproduction+7 -6
operations/puppetproduction+1 -1
operations/puppetproduction+3 -1
operations/puppetproduction+9 -10
operations/puppetproduction+21 -4
operations/puppetproduction+20 -14
operations/puppetproduction+8 -6
operations/puppetproduction+5 -8
operations/puppetproduction+34 -25
operations/puppetproduction+2 -0
operations/puppetproduction+18 -11
operations/puppetproduction+6 -4
operations/puppetproduction+110 -0
operations/puppetproduction+3 -0
operations/puppetproduction+61 -22
operations/puppetproduction+2 -0
operations/puppetproduction+22 -0
operations/puppetproduction+1 -0
operations/puppetproduction+9 -1
operations/puppetproduction+5 -11
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 735385 merged by Elukey:

[operations/puppet@production] role::pki::multirootca: add kafka intermediate CA config

https://gerrit.wikimedia.org/r/735385

Change 735565 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: add new Kafka PKI intermediate CA option

https://gerrit.wikimedia.org/r/735565

Change 735566 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::test::broker: use PKI Kafka TLS certificates

https://gerrit.wikimedia.org/r/735566

To recap the next steps:

  • Add the cfssl CA cert to the base truststore of all jvms (this IIUC is already done, but John lemme know if it is not)

I thought it was but i just checked and it dosn't seem to be, will fix

this has now been added

Just realized that this may need a roll restart of all the Kafka clusters to allow their jvms to pick up the new truststore (IIRC it is not done automagically)

Change 735565 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: add new Kafka PKI intermediate CA option

https://gerrit.wikimedia.org/r/735565

Change 735566 merged by Elukey:

[operations/puppet@production] role::kafka::test::broker: use PKI Kafka TLS certificates

https://gerrit.wikimedia.org/r/735566

A lot of progresses today with @jbond, here's a summary:

  • The new keystore contains the intermediate CA crt as we needed, tested with openssl s_client.
  • All the brokers of our clusters currently trust only the puppet CA, since they are configured with a truststore and this means that they don't look into the system cacert bundle.
  • We should create a new truststore with the Puppet CA and the Root PKI (not sure if we also need the intermediate in there), and distribute it to all brokers and clients.
  • After this, we should be able to flip one broker at the time without incurring in validation issues etc..

Change 736765 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add sslcert::trusted_root_ca

https://gerrit.wikimedia.org/r/736765

Change 736785 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: add truststore for pki-based tls certs

https://gerrit.wikimedia.org/r/736785

Change 736786 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable PKI TLS certificates for kafka-test1006

https://gerrit.wikimedia.org/r/736786

Change 736805 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: allow to override super_users

https://gerrit.wikimedia.org/r/736805

Change 736765 merged by Elukey:

[operations/puppet@production] Add sslcert::trusted_root_ca

https://gerrit.wikimedia.org/r/736765

Change 736785 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: add truststore for pki-based tls certs

https://gerrit.wikimedia.org/r/736785

Change 736805 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: allow to override super_users

https://gerrit.wikimedia.org/r/736805

Change 736786 merged by Elukey:

[operations/puppet@production] Enable PKI TLS certificates for kafka-test1006

https://gerrit.wikimedia.org/r/736786

Change 736991 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: add conditions to truststore deployments

https://gerrit.wikimedia.org/r/736991

Change 736991 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: add conditions to truststore deployments

https://gerrit.wikimedia.org/r/736991

Change 737055 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: use a jks truststore for PKI TLS certs

https://gerrit.wikimedia.org/r/737055

Change 737055 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: use a jks truststore for PKI TLS certs

https://gerrit.wikimedia.org/r/737055

Change 737091 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::mirror: add settings to support the migration to PKI

https://gerrit.wikimedia.org/r/737091

Change 737095 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] sslcert::trusted_ca: check if the bundle .pem is defined

https://gerrit.wikimedia.org/r/737095

Finally some progress!

Some notes:

  • We have now a generic define to create .p12/.jks truststores containing the bundle Puppet-CA/Root-PKI in puppet
  • The pre-steps to do before updating a given kafka cluster should be:
    • Replace all occurrences of the truststore used by brokers with the one created by the aforementioned define, so that brokers will be able to accept TLS certificates signed from both CAs. This needs a cluster restart.
    • Replace all occurrences of the trustore for kafka mirror maker instances (if used)
    • Change the super.users list in all broker configs to allow every CN:hostname of the cluster (in addition to the current CN:clustername). This allows to swap TLS certs on brokers one at the time, and it requires a cluster restart.

Change 737095 abandoned by Elukey:

[operations/puppet@production] sslcert::trusted_ca: check if the bundle .pem is defined

Reason:

https://gerrit.wikimedia.org/r/737095

Change 737091 abandoned by Elukey:

[operations/puppet@production] profile::kafka::mirror: add settings to support the migration to PKI

Reason:

https://gerrit.wikimedia.org/r/737091

Change 737403 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::base::certificates: add sslcert::trusted_ca options

https://gerrit.wikimedia.org/r/737403

Change 737408 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:trafficserver::backend: use ca provided by P:base::certificates

https://gerrit.wikimedia.org/r/737408

Change 737470 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: move to profile::base::certificates for pki

https://gerrit.wikimedia.org/r/737470

Change 737403 merged by Elukey:

[operations/puppet@production] profile::base::certificates: add sslcert::trusted_ca options

https://gerrit.wikimedia.org/r/737403

Change 737470 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: move to profile::base::certificates for pki

https://gerrit.wikimedia.org/r/737470

Change 737644 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:certificates: add defaults to cloud

https://gerrit.wikimedia.org/r/737644

Change 737644 merged by Jbond:

[operations/puppet@production] P:certificates: add defaults to cloud

https://gerrit.wikimedia.org/r/737644

Change 737645 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] sslcert::trusted_ca: fix file title

https://gerrit.wikimedia.org/r/737645

Change 737645 merged by Elukey:

[operations/puppet@production] sslcert::trusted_ca: fix file title

https://gerrit.wikimedia.org/r/737645

Change 737652 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] sslcert::trusted_ca: add explicit ordering for jks

https://gerrit.wikimedia.org/r/737652

Change 737652 merged by Elukey:

[operations/puppet@production] sslcert::trusted_ca: add explicit ordering for jks

https://gerrit.wikimedia.org/r/737652

Change 737661 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::mirror: add support for PKI-enabled truststore

https://gerrit.wikimedia.org/r/737661

Change 737661 merged by Elukey:

[operations/puppet@production] profile::kafka::mirror: add support for PKI-enabled truststore

https://gerrit.wikimedia.org/r/737661

Change 737672 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::base::certificates: add truststore password

https://gerrit.wikimedia.org/r/737672

Change 737672 merged by Elukey:

[operations/puppet@production] profile::base::certificates: add truststore password

https://gerrit.wikimedia.org/r/737672

Change 737711 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Enable PKI-based TLS certificate for kafka-test1006

https://gerrit.wikimedia.org/r/737711

Change 737711 merged by Elukey:

[operations/puppet@production] Enable PKI-based TLS certificate for kafka-test1006

https://gerrit.wikimedia.org/r/737711

Change 737920 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-test1006 to PKI-based broker certs

https://gerrit.wikimedia.org/r/737920

Change 737920 merged by Elukey:

[operations/puppet@production] Move kafka-test1006 to PKI-based broker certs

https://gerrit.wikimedia.org/r/737920

Finally kafka-test1006 is running with a PKI kafka intermediate cert, and the rest of the cluster works fine as well (still on puppet-based certs). Mirror Maker seems working fine with the new truststore (accepting both Puppet CA and PKI Root CA certs).

The procedure that we thought seems working, but before starting to move clusters we'll need to work on clients, to move them to the new bundle/truststore.

Change 737923 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kafkatee:instance: change TLS CA bundle

https://gerrit.wikimedia.org/r/737923

Change 737931 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::webrequest: move atskafka to new CA bundle

https://gerrit.wikimedia.org/r/737931

Change 737923 merged by Elukey:

[operations/puppet@production] kafkatee:instance: change TLS CA bundle

https://gerrit.wikimedia.org/r/737923

Mentioned in SAL (#wikimedia-operations) [2021-11-10T16:26:43Z] <elukey> move kafkatee instances (analytics-test,centralog) to the new CA bundle - T291905

Change 737931 merged by Elukey:

[operations/puppet@production] profile::cache::kafka::webrequest: move atskafka to new CA bundle

https://gerrit.wikimedia.org/r/737931

Mentioned in SAL (#wikimedia-operations) [2021-11-10T16:28:40Z] <elukey> move atskafka to the new CA bundle - T291905

Change 737970 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move coal and navtiming to the new CA bundle

https://gerrit.wikimedia.org/r/737970

Change 737983 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::base::certificates: vary trusted_certs on realm

https://gerrit.wikimedia.org/r/737983

First thing to follow up - deployment-prep:

  1. we have a kafka cluster in there (for example, to test evengate, etc..). Should we move it to PKI (if available) or just keep using the puppet certs?
  2. we currently deploy under /etc/ssl/localcerts the bundle of the production's puppet ca and root pki prod ca, that of course don't work in cloud.

I had a chat with @jbond and it should be available for deployment-prep, but we need something like https://gerrit.wikimedia.org/r/737983 first (to make sure that clients can trust the right CAs etc..). The only thing that we may need to add to profile::kafka::broker is the choice of the intermediate CA name to use, but should be easy enough!

Change 737983 merged by Elukey:

[operations/puppet@production] profile::base::certificates: vary trusted_certs on realm

https://gerrit.wikimedia.org/r/737983

Change 738958 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Update deployment-prep's profile::base::certificates settings

https://gerrit.wikimedia.org/r/738958

Change 738958 merged by Elukey:

[operations/puppet@production] Update deployment-prep's profile::base::certificates settings

https://gerrit.wikimedia.org/r/738958

Change 739266 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:pki::client: manually deploy the root CA in cloud

https://gerrit.wikimedia.org/r/739266

Change 739266 merged by Jbond:

[operations/puppet@production] P:pki::client: manually deploy the root CA in cloud

https://gerrit.wikimedia.org/r/739266

Change 737970 merged by Elukey:

[operations/puppet@production] Move coal, navtiming and statsv to the new CA bundle

https://gerrit.wikimedia.org/r/737970

Change 739463 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::rsyslog: move Kafka TLS CA settings to the new bundle

https://gerrit.wikimedia.org/r/739463

Change 739475 had a related patch set uploaded (by Elukey; author: Elukey):

[analytics/refinery@master] gobblin: use the new jks TLS bundle to validate certificates

https://gerrit.wikimedia.org/r/739475

Change 739476 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Deploy the wmf_trusted_cas.jks bundle where Gobblin runs

https://gerrit.wikimedia.org/r/739476

Change 739806 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::kafka::webrequest: add pki settings

https://gerrit.wikimedia.org/r/739806

Change 739806 merged by Elukey:

[operations/puppet@production] profile::cache::kafka::webrequest: add pki settings

https://gerrit.wikimedia.org/r/739806

Change 740083 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] kubernetes: expose internal CA bundle to helm

https://gerrit.wikimedia.org/r/740083

Change 739476 merged by Elukey:

[operations/puppet@production] Deploy the wmf_trusted_cas.jks bundle where Gobblin runs

https://gerrit.wikimedia.org/r/739476

Change 740086 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Deploy the WMF Internal CAs bundle truststore to Hadoop test workers

https://gerrit.wikimedia.org/r/740086

Change 740086 merged by Elukey:

[operations/puppet@production] Deploy the WMF Internal CAs bundle truststore to Hadoop workers

https://gerrit.wikimedia.org/r/740086

At this point we need to migrate all Kafka client using TLS to the new bundle before proceeding further with clusters. I think it is best to open a task for each cluster as sub-task.

Change 740091 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-test brokers to the PKI intermediate CA

https://gerrit.wikimedia.org/r/740091

Change 740091 merged by Elukey:

[operations/puppet@production] Move kafka-test brokers to the PKI intermediate CA

https://gerrit.wikimedia.org/r/740091

Change 739475 merged by Joal:

[analytics/refinery@master] gobblin: use the new jks TLS bundle to validate certificates

https://gerrit.wikimedia.org/r/739475

elukey@kafka-test1006:~$ openssl s_client -CAfile /etc/ssl/localcerts/wmf_trusted_root_CAs.pem -verify_hostname kafka-test1006.eqiad.wmnet kafka-test1006.eqiad.wmnet:9093 <<< "Q"
[..]

Certificate chain
 0 s:CN = kafka-test1006.eqiad.wmnet
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA

[..]
---
SSL handshake has read 2263 bytes and written 456 bytes
Verification: OK
Verified peername: kafka-test1006.eqiad.wmnet
---
[..]
DONE

Kafka test migration worked nicely :)

Change 740083 abandoned by Elukey:

[operations/puppet@production] kubernetes: expose internal CA bundle to helm

Reason:

Not needed!

https://gerrit.wikimedia.org/r/740083

Let's wait for T296089 before proceeding :)

Change 742482 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::kafka::Webrequest: always include profile::cache::kafka::certificate

https://gerrit.wikimedia.org/r/742482

Change 742482 merged by Jbond:

[operations/puppet@production] P:cache::kafka::Webrequest: always include profile::cache::kafka::certificate

https://gerrit.wikimedia.org/r/742482

Change 739463 merged by Elukey:

[operations/puppet@production] P:rsyslog::kafka_shipper: move Kafka TLS CA settings to the new bundle

https://gerrit.wikimedia.org/r/739463

Change 757800 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: add pki_intermediate_name parameter

https://gerrit.wikimedia.org/r/757800

Change 757800 abandoned by Elukey:

[operations/puppet@production] profile::kafka::broker: add pki_intermediate_name parameter

Reason:

Not needed since John is awesome

https://gerrit.wikimedia.org/r/757800

Change 773285 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: Rsyslog omkafka configs use new ca bundle

https://gerrit.wikimedia.org/r/773285

Change 773285 merged by Cwhite:

[operations/puppet@production] profile: Rsyslog omkafka configs use new ca bundle

https://gerrit.wikimedia.org/r/773285

elukey claimed this task.

The kafka logging clusters have the new PKI configuration:

openssl s_client -CAfile /etc/ssl/certs/ca-certificates.crt -verify_hostname kafka-logging1001.eqiad.wmnet kafka-logging1001.eqiad.wmnet:9093
[..]
SSL handshake has read 2270 bytes and written 459 bytes
Verification: OK
Verified peername: kafka-logging1001.eqiad.wmnet
[..]

The issue is solved for a real production cluster, the remaining work is in the subtasks (for Kafka Jumbo and main). We can close this task and follow up in the other ones.

Change 737408 abandoned by Jbond:

[operations/puppet@production] P:trafficserver::backend: use ca provided by P:base::certificates

Reason:

this has since been implmented

https://gerrit.wikimedia.org/r/737408