Page MenuHomePhabricator

Move Kafka logging to the new intermediate PKI
Closed, ResolvedPublic

Description

Hi folks!

The new Kafka intermediate PKI certificates have been deployed to the Kafka test cluster successfully, and these follow ups were done:

  1. Added TLS expiry checks for all brokers
  2. Set the default expiry time for broker certs to 1y (since it seems that we cannot reload the kafka keystores dynamically, more info T299409)
  3. Tested clients etc..

The main benefit of the new certificates are essentially two:

  1. The possibility to enable ssl hostname verification (currently failing for clients due to how the cergen certs that we have for kafka are made, see the parent task)
  2. Slowly abandoning the Puppet CA in favor of the PKI CA

Due to the amount of traffic of other clusters, like Jumbo and Main, it seems that the logging cluster could be a good candidate for the transition. If you like the idea:

  1. I'll update clients pushing data to Kafka logging to use the new ca bundle (for example, https://gerrit.wikimedia.org/r/c/operations/puppet/+/739463)
  2. I'll prepare the cluster for the transition. There is a procedure, that we have tested in Kafka test, to rollout the new changes one broker at the time without disrupting the traffic (more details to come in the task of course). The caveat is that all clients need to trust both Puppet and PKI CAs to avoid validation troubles.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+0 -4
operations/puppetproduction+5 -10
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+1 -3
operations/puppetproduction+2 -0
operations/puppetproduction+2 -0
operations/puppetproduction+12 -3
operations/puppetproduction+6 -0
operations/puppetproduction+9 -3
operations/puppetproduction+42 -13
operations/puppetproduction+16 -9
operations/puppetproduction+2 -2
operations/puppetproduction+2 -2
operations/puppetproduction+8 -1
operations/puppetproduction+8 -0
operations/puppetproduction+20 -15
operations/puppetproduction+1 -1
Show related patches Customize query in gerrit

Event Timeline

This is a good step forward. Thank you!

I realize deployment-prep may not be in scope for this project, but we have a vested interest in keeping beta-logs.wm.o working.

The WMCS deployment-prep->logging environment works, but with caveats. The current method of certificate validation is not ideal. The short version of the problem is:

  1. kafka must run in the deployment-prep environment for correct provisioning of certificates for the rsyslog->kafka ssl connection
  2. the librdkafka version in rsyslog is somewhat old and does not expose some ssl configuration options
  3. the truststore must be manually copied from deployment-prep to the logging environment to enable the logstash->kafka ssl connection

Is it possible to use this change for the benefit of deployment-prep->logging environments as well?

When I worked with John on T296089 we wanted to give a way to deploy bundles across realms, so it is in scope to migrate deployment-prep as well if it is a critical piece of your testing infra :)

So IIUC there are two separate WMCS projects, deployment-prep and logging, and currently you are:

  1. Sending data from rsyslog to kafka within the deployment-prep project
  2. Pull data from kafka to logstash (deployment-prep -> logging)

If the above is true I definitely understand why copying the truststore is needed. There is a config option in puppet called profile::base::certificates::include_bundle_jks, that is meant to create a truststore containing the root CA certs of the PKI + Puppet masters set on a given WMCS project (in prod it returns the content of the wmf-certificates package that is the canonical source of truth). The copying part is probably not avoidable, but we can try to see if there is a workaround.

What do you have in mind? We can work together on moving the WMCS setup to PKI first (in deployment-prep there should be a test PKI infra), lemme know what you'd prefer to do.

Thanks!

Change 757661 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert

https://gerrit.wikimedia.org/r/757661

Change 757661 merged by Elukey:

[operations/puppet@production] mediawiki::logging::yaml_defs: use wmf-certificates' bundle as CA cert

https://gerrit.wikimedia.org/r/757661

Status update: kafka producers (rsyslog on regular nodes and mwdebug k8s) have been migrated to the new ca bundle.

The next step is to migrate the logstash kafka consumer to the new bundle.

@colewhite IIUC we should migrate all the logstash::input::kafka occurences to a new jks bundle containing both the Root PKI and the Puppet CA certs. The bundle is available via profile::base::certificates, simply turning on the hiera flag profile::base::certificates::include_bundle_jks. Is it something that we can do or do you prefer to wait?

Change 763110 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] logstash::input::kafka: allow a custom truststore path

https://gerrit.wikimedia.org/r/763110

Change 763113 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash::beta: move to profile::base::certificate's truststore

https://gerrit.wikimedia.org/r/763113

Change 763172 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::logstash::production: use base truststore

https://gerrit.wikimedia.org/r/763172

jbond triaged this task as Medium priority.Feb 16 2022, 4:56 PM
jbond subscribed.

Change 763110 merged by Elukey:

[operations/puppet@production] logstash::input::kafka: allow a custom truststore path

https://gerrit.wikimedia.org/r/763110

Very interesting use case in https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113, namely Beta/deployment-prep. We have two sets of VMs:

  • Kafka logging (cluster), in deployment-prep, currently using certs created by the related puppet CA.
  • Logstash on logging (project), that pulls from Kafka logging (cluster) using as truststore the Puppet CA public cert stored in the labs_private repo.

This is not easy since profile::base::certificates, afaics, will create a bundle in the logging project/hosts containing the logging Puppet CA cert, not the one used by the Kafka cluster (deployment-prep Puppet CA).
PKI is configured for deployment-prep, but in theory it should be possible to point the logging project hosts to it via profile::pki::client.
We'll not be able, IIUC, to deploy a bundle able to allow us to transition the deployment-prep kafka logging cluster transparently to PKI without causing issues in consumers (in this case, logstash on the logging project).

A possible way forward is to move the Kafka logging cluster in deployment-prep to PKI, stopping all the nodes using it, and instructing the logging project's logstash nodes to use the profile::base::certificate jks bundle. The jks will contain the wrong Puppet CA, but it will package (hopefully) the deployment-prep's PKI root cert that is what we need.

@jbond (if you have time) - is what I wrote above right? Or is there a better way?

but in theory it should be possible to point the logging project hosts to it via profile::pki::client.

that and adding the logging puppet public ca to the pki project, see https://wikitech.wikimedia.org/wiki/PKI/Cloud#Adding_a_new_project_to_pki-intermediate

We'll not be able, IIUC, to deploy a bundle able to allow us to transition the deployment-prep kafka logging cluster transparently to PKI without causing issues in consumers (in this case, logstash on the logging project).#

The pki end point in cloud uses puppet:///modules/profile/pki/cloud/pki_api_ca.pem as the trust store for client requests. To get the logging project to work with pki we will need to add its puppet master ca certificate to this file as well. As such deployment-prep and logging could use this file as well IIUC

A possible way forward is to move the Kafka logging cluster in deployment-prep to PKI,

+1 to this

stopping all the nodes using it, and instructing the logging project's logstash nodes to use the profile::base::certificate jks bundle. The jks will contain the wrong Puppet CA, but it will package (hopefully) the deployment-prep's PKI root cert that is what we need.

@jbond (if you have time) - is what I wrote above right? Or is there a better way?

This alls sounds like a good way forward to me but may need some caressing as we go :)

@colewhite lemme know what you prefer to test :)

Change 769711 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: add pki to logging env

https://gerrit.wikimedia.org/r/769711

Change 769711 merged by Cwhite:

[operations/puppet@production] hiera: add pki to logging env

https://gerrit.wikimedia.org/r/769711

Reverted due to puppet failures:

  1. I think the cloud puppetmaster doesn't have a cert at /etc/ssl/certs/%{lookup('profile::pki::client::root_ca_cn')}.pem ? (Expects file:///etc/ssl/certs/WMF_TEST_CA.pem)
  2. Class[Java]: expects a value for parameter 'java_packages' (file: /etc/puppet/modules/java/manifests/cacert.pp, line: 15, column: 5) (file: /etc/puppet/modules/sslcert/manifests/trusted_ca.pp, line: 63)
    1. This seems odd because we're configuring java_packages via profile::java.

@colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the cloud project, so I am wondering if this is something that you'd be willing to support/add in the logging project or if you'd prefer not. If you are ok to add a puppetmaster we could work together on it, afterwards we can configure it to work with PKI and restart our Beta/Logging upgrade plan :)

@colewhite I had a chat with John and the only current supported way is to have a self-hosted puppet master in the cloud project, so I am wondering if this is something that you'd be willing to support/add in the logging project or if you'd prefer not. If you are ok to add a puppetmaster we could work together on it, afterwards we can configure it to work with PKI and restart our Beta/Logging upgrade plan :)

It's a bummer, but we knew it was a possibility going into it. I'd rather not run a puppetmaster if possible so we can fall back on the manual key installation for now.

Let's configure puppet in deployment-prep to write the new Logstash keystore somewhere and update applicable documentation.

A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::base::certificates profile, and modify the current logstash jks truststore files to include the Root PKI certificate. Do you know where the current truststore for the logstash logging instances is picked up from?

On the logstash-logging-01 node I see /etc/logstash/kafka_logging-beta.truststore.jks, that contains the puppet PKI certificate of deployment-prep, so if we find where it is defined we can definitely add another one. In logstash::input::kafka I see that the private secrets module seem to be used, but I am a little confused about how it works now for the logging project (the fake private repo contain fake secrets/truststores only for PCC for example).

A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::base::certificates profile, and modify the current logstash jks truststore files to include the Root PKI certificate. Do you know where the current truststore for the logstash logging instances is picked up from?

The jks keystore was generated on deployment-logstash03 (now deleted) and manually copied to logging-logstash-01.

deployment-kafka-logging01 seems a logical place to write the keystore.

A possibile way forward is to modify https://gerrit.wikimedia.org/r/c/operations/puppet/+/763113 to avoid the profile::base::certificates profile, and modify the current logstash jks truststore files to include the Root PKI certificate. Do you know where the current truststore for the logstash logging instances is picked up from?

The jks keystore was generated on deployment-logstash03 (now deleted) and manually copied to logging-logstash-01.

deployment-kafka-logging01 seems a logical place to write the keystore.

Perfect, so deployment-kafka-logging01 simply needs to get profile::base::certificates::include_bundle_jks: true from hiera (not sure what is the best place nowadays to set the option, horizon prefix-puppet?) and than the truststore will be generated at the first puppet run. Then it could be copied over to logging-logstash-01 and we'd be done (the truststore generated by profile::base::certificates creates a jks bundle using the changeit password).

@colewhite on logging-logstash-01.logging.eqiad1.wikimedia.cloud there is /etc/ssl/localcerts/wmf-java-cacerts, a jks that should contain the two Root CA certs that we need (password: changeit). In theory now the next step is to configure the logstash kafka client to use it, and if it works then we'll be able to just upgrade deployment-kafka-logging to the kafka PKI transparently. What do you think?

Change 771737 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] beta-logs: use new kafka truststore

https://gerrit.wikimedia.org/r/771737

What do you think?

The new truststore works. Let's have Logstash use it.

Change 771737 merged by Cwhite:

[operations/puppet@production] beta-logs: use new kafka truststore

https://gerrit.wikimedia.org/r/771737

Change 771816 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Fix root PKI CA CN for deployment-prep

https://gerrit.wikimedia.org/r/771816

Change 771816 abandoned by Elukey:

[operations/puppet@production] Fix root PKI CA CN for deployment-prep

Reason:

https://gerrit.wikimedia.org/r/771816

Change 771905 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::rsyslog: add new cabundle paths for omkafka

https://gerrit.wikimedia.org/r/771905

@colewhite I was able to move the deployment-prep's kafka logging host to PKI, the new TLS settings seem to work but lemme know if you see anything weird on the logstash side.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/771905 is needed to complete the rollout though, since I realized that not all omkafka configs have the new cabundle. The code review seems really no-impact since we already run omkafka with the new cabundle, but I am fairly ignorant about this bit of rsyslog config so let me know what you think.

In the meantime, I have rolled back kafka logging in deployment-prep to the old TLS certs via the per-host hiera key profile::kafka::broker::ssl_generate_certificates.

@colewhite I was able to move the deployment-prep's kafka logging host to PKI, the new TLS settings seem to work but lemme know if you see anything weird on the logstash side.

Looks like all logs were dropped during the transition time: https://beta-logs.wmcloud.org/goto/9b1a56afc7153dc2dbbe2d575252601d

Logstash says it can't find a valid certification path:

PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

Change 771905 merged by Elukey:

[operations/puppet@production] profile::rsyslog: add new cabundle paths for omkafka

https://gerrit.wikimedia.org/r/771905

Change 763113 abandoned by Elukey:

[operations/puppet@production] profile::logstash::beta: move to profile::base::certificate's truststore

Reason:

We followed another road, see T300130

https://gerrit.wikimedia.org/r/763113

@colewhite I was able to move the deployment-prep's kafka logging host to PKI, the new TLS settings seem to work but lemme know if you see anything weird on the logstash side.

Looks like all logs were dropped during the transition time: https://beta-logs.wmcloud.org/goto/9b1a56afc7153dc2dbbe2d575252601d

Logstash says it can't find a valid certification path:

PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

There were two time windows in which kafka logging wasn't available, the broker's TLS PKI cert was wrong and it was causing kafka connections to drop. After John's fix to the PKI Beta infra the broker finally got a valid cert, and then I reverted back to its original state. Let's try to roll it out again today when you are online, it should work without problems!

If the Beta experiment works, I think that we are ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/763172 :)

Change 763172 merged by Cwhite:

[operations/puppet@production] profile::logstash::production: use base truststore

https://gerrit.wikimedia.org/r/763172

If the Beta experiment works, I think that we are ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/763172 :)

Thanks to Cole, this is done!

Next steps:

  • Review the list of Kafka logging clients to see if we forgot anything.
  • Add settings to the kafka logging cluster to allow its transition to PKI (super users, etc..)
  • Move the first broker to PKI

Change 772788 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::logging: add PKI migration settings

https://gerrit.wikimedia.org/r/772788

Change 773285 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: Rsyslog omkafka configs use new ca bundle

https://gerrit.wikimedia.org/r/773285

Change 773285 merged by Cwhite:

[operations/puppet@production] profile: Rsyslog omkafka configs use new ca bundle

https://gerrit.wikimedia.org/r/773285

Change 772788 merged by Elukey:

[operations/puppet@production] role::kafka::logging: add PKI migration settings

https://gerrit.wikimedia.org/r/772788

@colewhite hi! There is no rush at the moment of course, but I am wondering what remaining clients needed to be migrated before being able to switch the broker's TLS certs to PKI.

@colewhite hi! Periodical ping to see if we can move forward with this task. IIRC there were some clients to move to the new bundle, what's the status? Thanks :)

Ping again @colewhite to see if we can proceed or not during the next months :)

Followed up offline. @elukey and I are scheduling a time to complete this.

@colewhite do you know if there are remaining clients still not using the new TLS bundle? IIRC there were a couple that needed to be fixed before proceeding. Once we have all clients reviewed and updated, we can start with upgrading one broker and monitor for any anomalies (the current puppet settings allow a broker with the new PKI TLS cert and the other ones with the puppet based ones).

Clients:

  • rsyslog: /etc/ssl/certs/wmf-ca-certificates.crt
  • logstash collectors: /etc/ssl/localcerts/wmf-java-cacerts
  • kafkatee on centrallog: /etc/ssl/certs/wmf-ca-certificates.crt
  • !!! apifeatureusage collectors: /etc/logstash/kafka_logging-eqiad.truststore.jks

Might be all that is left is to migrate the apifeatureusage collectors.

Any I missed, @fgiunchedi @herron ?

Change 830684 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: use new kafka truststore

https://gerrit.wikimedia.org/r/830684

Change 830684 merged by Cwhite:

[operations/puppet@production] apifeatureusage: use new kafka truststore

https://gerrit.wikimedia.org/r/830684

apifeatureusage now using the new pki truststore and appears to be working.

Change 831096 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka on kafka-logging2001 to PKI TLS certificates

https://gerrit.wikimedia.org/r/831096

Change 831096 merged by Elukey:

[operations/puppet@production] Move kafka on kafka-logging2001 to PKI TLS certificates

https://gerrit.wikimedia.org/r/831096

kafka-logging2001 migrated to PKI, all good from what I can see in metrics!

Next steps:

  • wait a couple of days with the current config to see if anything comes up (rollback: revert https://gerrit.wikimedia.org/r/831096, run puppet on kafka-logging2001, restart kafka)
  • proceed with the other two codfw nodes
  • wait a little more (likely the Prague offsite)
  • move the eqiad cluster to PKI

@colewhite does it sound good?

Change 831588 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka on kafka-logging2002 to a PKI-based TLS cert

https://gerrit.wikimedia.org/r/831588

Change 831588 merged by Elukey:

[operations/puppet@production] Move kafka on kafka-logging2002 to a PKI-based TLS cert

https://gerrit.wikimedia.org/r/831588

Change 831831 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::logging: move kafka on all codfw nodes to PKI certificates

https://gerrit.wikimedia.org/r/831831

Change 831831 merged by Elukey:

[operations/puppet@production] role::kafka::logging: move kafka on all codfw nodes to PKI certificates

https://gerrit.wikimedia.org/r/831831

kafka logging codfw on PKI, all hosts moved!

Next step: kafka-logging-eqiad :)

Change 837621 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-logging2001 to PKI settings for TLS

https://gerrit.wikimedia.org/r/837621

Change 837621 merged by Elukey:

[operations/puppet@production] Move kafka-logging1001 to PKI settings for TLS

https://gerrit.wikimedia.org/r/837621

Change 838123 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-logging1002's Kafka TLS config to PKI

https://gerrit.wikimedia.org/r/838123

Change 838123 merged by Elukey:

[operations/puppet@production] Move kafka-logging1002's Kafka TLS config to PKI

https://gerrit.wikimedia.org/r/838123

Change 838643 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafka-logging1003 to the kafka PKI intermediate CA

https://gerrit.wikimedia.org/r/838643

Change 838650 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::kafka::logging: final clean up after migrating to PKI

https://gerrit.wikimedia.org/r/838650

Change 838643 merged by Elukey:

[operations/puppet@production] Move kafka-logging1003 to the kafka PKI intermediate CA

https://gerrit.wikimedia.org/r/838643

Change 838650 merged by Elukey:

[operations/puppet@production] role::kafka::logging: final clean up after migrating to PKI

https://gerrit.wikimedia.org/r/838650

Both clusters are running PKI and today I have also ran the following clean up steps:

  1. removed the old puppet cert's CN from the kafka super.users config (https://gerrit.wikimedia.org/r/838650). The super.users are used to establish who a broker can trust (and in this case, if a new broker can be trusted to be part of the cluster).
  2. removed the old keystore (containing the puppet cert) from all the nodes

Roll restart of all brokers, everything worked nicely.

The work is basically completed, the only remaining step is to clean up the old kafka logging puppet certificate from puppet private (and revoke it from the puppet CA).

@colewhite I'll leave the last step to you if it is ok, so you can decide when to pull the trigger (maybe we can wait some days so we have a roll back plan ready to go in case something looks weird).

Change 879886 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] apifeatureusage: use new kafka truststore

https://gerrit.wikimedia.org/r/879886

The old certificates are now cleaned up. We'll want to perform a kafka restart to gain confidence in our changes.

Roll restart of both clusters completed! Checked the cert's expire dates:

2001
            Not After : Sep 12 07:55:00 2023 GMT
2002
            Not After : Sep 13 09:48:00 2023 GMT
2003
            Not After : Sep 14 06:31:00 2023 GMT
2004
            Not After : Jan 19 15:15:00 2024 GMT
2005
            Not After : Jan 19 15:17:00 2024 GMT

1001
            Not After : Oct  4 07:07:00 2023 GMT
1002
            Not After : Oct  5 06:24:00 2023 GMT
1003
            Not After : Oct  5 07:46:00 2023 GMT

The last two nodes added in codfw (200[4,5]) have a different expire date, so it may happen in the future that we'll need to roll restart them apart from the rest. We have monitors in place to alert us though, so no problem :)

From the SAL:

!log Clean old puppet certs kafka_logging-{eqiad,codfw}_broker from the Puppet CA and from Puppet private - T300130