Page MenuHomePhabricator

Move Kafka Jumbo's TLS clients to the new bundle
Open, In Progress, MediumPublic

Description

The parent task describes the current migration of Kafka brokers to the new Kafka PKI Intermediate CA. We need to update Kafka TLS client configs to use a truststore/bundle that accepts TLS certificates signed by the new Intermediate or by the Puppet CA.

List of Jumbo clients:

  • FR kafkatee
  • SRE kafkatee
  • mirror maker
  • varishkafka
  • atskafka
  • gobblin
  • netflow
  • eventgate analytics

Quickly verified on kafka-jumbo1001 with netstat -tuap | grep :9093 | awk '{print $4" "$5}' | sort | uniq but please let me know if I am missing any.

Event Timeline

odimitrijevic edited projects, added Analytics-Radar; removed Analytics.
elukey changed the task status from Open to Stalled.Nov 24 2021, 3:59 PM

Setting this to stalled until we agree on https://phabricator.wikimedia.org/T296089

Change 742671 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] atskafka: use the same ca certificate as varnishkafka

https://gerrit.wikimedia.org/r/742671

Change 742671 merged by Elukey:

[operations/puppet@production] atskafka: use the same ca certificate as varnishkafka

https://gerrit.wikimedia.org/r/742671

Change 742747 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] varnishkafka: use new ca bundle instead of the Puppet one

https://gerrit.wikimedia.org/r/742747

elukey changed the task status from Stalled to In Progress.Nov 30 2021, 3:40 PM

Change 742753 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] netflow: move kafka config to new CA bundle

https://gerrit.wikimedia.org/r/742753

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

Hey @elukey, this should not be a problem however this is exactly the wrong time of year to mess with the kafkatee pipeline. Can we postpone until early January?

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

Hey @elukey, this should not be a problem however this is exactly the wrong time of year to mess with the kafkatee pipeline. Can we postpone until early January?

Sure makes sense, we can postpone it. I'll try to work on other clusters before Jumbo :)

elukey changed the task status from In Progress to Stalled.Nov 30 2021, 5:30 PM

Back to stalled, let's do it in January!

Change 742753 merged by Elukey:

[operations/puppet@production] netflow: move kafka config to new CA bundle

https://gerrit.wikimedia.org/r/742753

elukey changed the task status from Stalled to In Progress.Jan 11 2022, 8:21 AM

Back to in-progress, the FR kafkatee instances moved to the new bundle!

Change 752992 had a related patch set uploaded (by Elukey; author: Elukey):

[eventgate-wikimedia@master] blubber: add wmf-certificates to the Docker images

https://gerrit.wikimedia.org/r/752992

Next steps:

Change 752992 merged by Ottomata:

[eventgate-wikimedia@master] blubber: add wmf-certificates to the Docker images

https://gerrit.wikimedia.org/r/752992

Change 753425 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: move eventgate-analytics* to the WMF CA cert bundle

https://gerrit.wikimedia.org/r/753425

Change 753428 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/eventstreams@master] blubber: deploy the wmf-certificates package in prod

https://gerrit.wikimedia.org/r/753428

Change 753428 merged by Elukey:

[mediawiki/services/eventstreams@master] blubber: deploy the wmf-certificates package in prod

https://gerrit.wikimedia.org/r/753428

Mentioned in SAL (#wikimedia-analytics) [2022-01-26T10:07:27Z] <btullis> btullis@cumin1001:~$ sudo cumin 'O:cache::upload or O:cache::text' 'disable-puppet btullis-T296064-T299401'

Change 742747 merged by Btullis:

[operations/puppet@production] varnishkafka: use new ca bundle instead of the Puppet one

https://gerrit.wikimedia.org/r/742747

The last clients to move should be eventstreams and eventgate!

Next steps:

Change 753425 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: move eventgate* to the WMF CA cert bundle

https://gerrit.wikimedia.org/r/753425

Mentioned in SAL (#wikimedia-operations) [2022-01-26T14:41:39Z] <ottomata> deploying new CA certs for all eventgate services... T296064

Mentioned in SAL (#wikimedia-operations) [2022-01-26T15:24:45Z] <ottomata> paused (for meetings) in deploying new CA certs for all eventgate services, still TODO: eventgate-analytics-external, eventgate-main - T296064

Mentioned in SAL (#wikimedia-operations) [2022-01-27T14:54:15Z] <ottomata> continuing deployments of eventgate-main and eventgate-analytics to pick up CA cert changes - T296064 (also deploying eventgate-main for a schema repo bump for search)

Change 757672 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] eventstreams: move kafka config to new ca-bundle

https://gerrit.wikimedia.org/r/757672

Change 757672 merged by Ottomata:

[operations/deployment-charts@master] eventstreams: move kafka config to new ca-bundle

https://gerrit.wikimedia.org/r/757672

From what I can see from netstat on Jumbo nodes, all the clients that may be affected by this transition have been ported to the new CA bundle. This means that we could move one broker to the new PKI once we feel it is the right time.

Question mark about what to do in deployment prep, since IIRC there is kafka cluster in there using TLS certs.

Getting back to the task - we have moved kafka logging-codfw to PKI (eqiad will follow soon), so all the upgrade workflow is sound and safe (we had only tested it on kafka test).

Next steps:

  • Think about deployment-prep and move the hosts to PKI in there too.
  • Move Kafka Jumbo to PKI

Is it something that the DE team can work on during the next months?

BTW, @elukey I will likely reach out to you about how to do PKI right from the get go for T314156: Q1:rack/setup/install kafka-stretch100[12]

Sure! In theory all that is needed is:

profile::kafka::broker::ssl_generate_certificates: true
profile::base::certificates::include_bundle_jks: true

And the keystore password set in Puppet private (as we currently do for the puppet-based certs).

Do you think that we could try to test kafka 2.x or 3.x for the new cluster? Or would it derail/slowdown too much? I can help of course!

I added a detailed plan for kafka-main in T319372 :)

Change 901549 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::jumbo::broker: enable PKI migration settings

https://gerrit.wikimedia.org/r/901549