Page MenuHomePhabricator

Move Kafka Jumbo's TLS clients to the new bundle
Closed, ResolvedPublic

Description

The parent task describes the current migration of Kafka brokers to the new Kafka PKI Intermediate CA. We need to update Kafka TLS client configs to use a truststore/bundle that accepts TLS certificates signed by the new Intermediate or by the Puppet CA.

List of Jumbo clients:

  • FR kafkatee
  • SRE kafkatee
  • mirror maker
  • varishkafka
  • atskafka
  • gobblin
  • netflow
  • eventgate analytics

Quickly verified on kafka-jumbo1001 with netstat -tuap | grep :9093 | awk '{print $4" "$5}' | sort | uniq but please let me know if I am missing any.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
odimitrijevic edited projects, added Analytics-Radar; removed Analytics.
elukey changed the task status from Open to Stalled.Nov 24 2021, 3:59 PM

Setting this to stalled until we agree on https://phabricator.wikimedia.org/T296089

Change 742671 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] atskafka: use the same ca certificate as varnishkafka

https://gerrit.wikimedia.org/r/742671

Change 742671 merged by Elukey:

[operations/puppet@production] atskafka: use the same ca certificate as varnishkafka

https://gerrit.wikimedia.org/r/742671

Change 742747 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] varnishkafka: use new ca bundle instead of the Puppet one

https://gerrit.wikimedia.org/r/742747

elukey changed the task status from Stalled to In Progress.Nov 30 2021, 3:40 PM

Change 742753 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] netflow: move kafka config to new CA bundle

https://gerrit.wikimedia.org/r/742753

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

Hey @elukey, this should not be a problem however this is exactly the wrong time of year to mess with the kafkatee pipeline. Can we postpone until early January?

@Jgreen Hi! I am trying to move the Kafka Jumbo brokers TLS certs to the new PKI Intermediate CA dedicated to them, that will finally allow us to have per-host TLS certificates and stop using the Puppet CA. Before doing any switch all clients needs to trust the Root PKI CA cert and the Puppet CA one, so that I'll be able to move one broker at the time without impacting clients.

The client TLS certificates for the moment will not be touched.

We have created some helper functions and puppet code in profile::base::certificates, for prod we are basically using what's provided by the package wmf-certificates (that provides /etc/ssl/certs/wmf-ca-certificates.crt). I am not familiar with the code that you run on Fundraising, let me know if it is feasible to move the kafkatee's config to the new bundle on your side.

More info in T296089#7537901

Thanks in advance!

Hey @elukey, this should not be a problem however this is exactly the wrong time of year to mess with the kafkatee pipeline. Can we postpone until early January?

Sure makes sense, we can postpone it. I'll try to work on other clusters before Jumbo :)

elukey changed the task status from In Progress to Stalled.Nov 30 2021, 5:30 PM

Back to stalled, let's do it in January!

Change 742753 merged by Elukey:

[operations/puppet@production] netflow: move kafka config to new CA bundle

https://gerrit.wikimedia.org/r/742753

elukey changed the task status from Stalled to In Progress.Jan 11 2022, 8:21 AM

Back to in-progress, the FR kafkatee instances moved to the new bundle!

Change 752992 had a related patch set uploaded (by Elukey; author: Elukey):

[eventgate-wikimedia@master] blubber: add wmf-certificates to the Docker images

https://gerrit.wikimedia.org/r/752992

Next steps:

Change 752992 merged by Ottomata:

[eventgate-wikimedia@master] blubber: add wmf-certificates to the Docker images

https://gerrit.wikimedia.org/r/752992

Change 753425 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] helmfile.d: move eventgate-analytics* to the WMF CA cert bundle

https://gerrit.wikimedia.org/r/753425

Change 753428 had a related patch set uploaded (by Elukey; author: Elukey):

[mediawiki/services/eventstreams@master] blubber: deploy the wmf-certificates package in prod

https://gerrit.wikimedia.org/r/753428

Change 753428 merged by Elukey:

[mediawiki/services/eventstreams@master] blubber: deploy the wmf-certificates package in prod

https://gerrit.wikimedia.org/r/753428

Mentioned in SAL (#wikimedia-analytics) [2022-01-26T10:07:27Z] <btullis> btullis@cumin1001:~$ sudo cumin 'O:cache::upload or O:cache::text' 'disable-puppet btullis-T296064-T299401'

Change 742747 merged by Btullis:

[operations/puppet@production] varnishkafka: use new ca bundle instead of the Puppet one

https://gerrit.wikimedia.org/r/742747

The last clients to move should be eventstreams and eventgate!

Next steps:

Change 753425 merged by Elukey:

[operations/deployment-charts@master] helmfile.d: move eventgate* to the WMF CA cert bundle

https://gerrit.wikimedia.org/r/753425

Mentioned in SAL (#wikimedia-operations) [2022-01-26T14:41:39Z] <ottomata> deploying new CA certs for all eventgate services... T296064

Mentioned in SAL (#wikimedia-operations) [2022-01-26T15:24:45Z] <ottomata> paused (for meetings) in deploying new CA certs for all eventgate services, still TODO: eventgate-analytics-external, eventgate-main - T296064

Mentioned in SAL (#wikimedia-operations) [2022-01-27T14:54:15Z] <ottomata> continuing deployments of eventgate-main and eventgate-analytics to pick up CA cert changes - T296064 (also deploying eventgate-main for a schema repo bump for search)

Change 757672 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] eventstreams: move kafka config to new ca-bundle

https://gerrit.wikimedia.org/r/757672

Change 757672 merged by Ottomata:

[operations/deployment-charts@master] eventstreams: move kafka config to new ca-bundle

https://gerrit.wikimedia.org/r/757672

From what I can see from netstat on Jumbo nodes, all the clients that may be affected by this transition have been ported to the new CA bundle. This means that we could move one broker to the new PKI once we feel it is the right time.

Question mark about what to do in deployment prep, since IIRC there is kafka cluster in there using TLS certs.

Getting back to the task - we have moved kafka logging-codfw to PKI (eqiad will follow soon), so all the upgrade workflow is sound and safe (we had only tested it on kafka test).

Next steps:

  • Think about deployment-prep and move the hosts to PKI in there too.
  • Move Kafka Jumbo to PKI

Is it something that the DE team can work on during the next months?

BTW, @elukey I will likely reach out to you about how to do PKI right from the get go for T314156: Q1:rack/setup/install kafka-stretch100[12]

Sure! In theory all that is needed is:

profile::kafka::broker::ssl_generate_certificates: true
profile::base::certificates::include_bundle_jks: true

And the keystore password set in Puppet private (as we currently do for the puppet-based certs).

Do you think that we could try to test kafka 2.x or 3.x for the new cluster? Or would it derail/slowdown too much? I can help of course!

I added a detailed plan for kafka-main in T319372 :)

Change 901549 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::jumbo::broker: enable PKI migration settings

https://gerrit.wikimedia.org/r/901549

Change 901549 merged by Elukey:

[operations/puppet@production] role::kafka::jumbo::broker: enable PKI migration settings

https://gerrit.wikimedia.org/r/901549

The cluster is now running with the extended trust store (containing both Puppet and PKI's root CA certs).

Next steps:

  • Move kafka-jumbo1001 to PKI

@Jgreen Hi! Just an heads up that we are going to proceed with this, let us know if you see any issue on your side. In theory after T296765 we should be good, but better to verify just in case :) We'll move a single broker to the new TLS certs, so in case something doesn't work your kafkatee instance should move to another broker.

Change 903245 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-jumbo1001's kafka broker to PKI certs

https://gerrit.wikimedia.org/r/903245

The cluster is now running with the extended trust store (containing both Puppet and PKI's root CA certs).

Next steps:

  • Move kafka-jumbo1001 to PKI

@Jgreen Hi! Just an heads up that we are going to proceed with this, let us know if you see any issue on your side. In theory after T296765 we should be good, but better to verify just in case :) We'll move a single broker to the new TLS certs, so in case something doesn't work your kafkatee instance should move to another broker.

Ok, thanks for the heads up!

Change 903245 merged by Elukey:

[operations/puppet@production] Move kafka-jumbo1001's kafka broker to PKI certs

https://gerrit.wikimedia.org/r/903245

Mentioned in SAL (#wikimedia-operations) [2023-03-29T09:02:52Z] <elukey> move kafka on kafka-jumbo1001 to PKI TLS certs - T296064

Next steps:

  • Move the remaining nodes to PKI

Change 904455 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::jumbo::broker: upgrade all brokers to PKI

https://gerrit.wikimedia.org/r/904455

I've read https://www.golinuxcloud.com/troubleshooting-tls-failures-wireshark/ and found the following tshark filter to use on kafka-jumbo1001 to verify if a client is having trouble with the TLS handshake:

tshark -f "port 9093" -Y "ssl.record.content_type == 21"

So far all good, I'll keep it monitored a bit before proceeding with https://gerrit.wikimedia.org/r/904455

Change 904455 merged by Elukey:

[operations/puppet@production] role::kafka::jumbo::broker: upgrade all brokers to PKI

https://gerrit.wikimedia.org/r/904455

Mentioned in SAL (#wikimedia-operations) [2023-03-31T09:02:18Z] <elukey> move kafka-jumbo1002's kafka broker cert to PKI - T296064

Status - all brokers are getting the new TLS certificates via puppet, I'll keep restarting one broker at the time during the next days so we can monitor clients etc..

Mentioned in SAL (#wikimedia-operations) [2023-03-31T09:54:46Z] <elukey> move kafka-jumbo1003's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-03-31T13:12:48Z] <elukey> move kafka-jumbo1004's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-04-03T06:52:43Z] <elukey> move kafka-jumbo1005's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-04-03T07:43:02Z] <elukey> move kafka-jumbo1006's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-04-03T08:03:24Z] <elukey> move kafka-jumbo1008's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-04-03T08:54:30Z] <elukey> move kafka-jumbo1009's kafka broker cert to PKI - T296064

Mentioned in SAL (#wikimedia-operations) [2023-04-03T09:19:44Z] <elukey> move kafka-jumbo1006's kafka broker cert to PKI - T296064

The cluster runs on PKI! \o/

Next steps:

  • Clean up old TLS certificates in puppet private before closing

Last steps:

  • clean up certs in puppet private + revoke them in the puppet CA
  • verify if any change is needed in deployment-prep

Re-added the Data Engineering tag to triage this task again (requires manual work from the team, see above). Should be quick and easy, the cluster has already been ported to PKI and it works fine.

elukey claimed this task.

All done!