Page MenuHomePhabricator

Fix TLS certificate location and expire for Hadoop/Presto/etc.. and add alarms on TLS cert expiry
Open, HighPublic

Description

There are currently some problems with our (self signed) TLS cert for Hadoop/Presto/etc..:

  • the creator of the certs (yours truly) didn't use correctly cergen in the Puppet private repo, since dedicated directories were created. The main problem is that people trying to use cergen in the base directory of all the certs get warnings about some of them being absent. I tried to move them away from their dedicated location to another one, but this requires also changing the name of the dedicated self signed CA from root_ca to something-hadoop/presto-root_ca for example, that is not a idempotent change. The name of the CA in the cergen yaml file is used in the CA cert itself, so this might be problematic if we want to regenerate some files (tried and failed). We could simply think about creating new certs, deploy them and deprecate the old ones.
  • the CA's cert expiry is October 2020, we need to regenerate it to something longer that this. We could simply sto Yarn/Presto for a moment, regenerate all the certs, and deploy them. Important note is that puppet paths need to be changed probably, better to triple check.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -20
operations/puppetproduction+15 -66
operations/puppetproduction+33 -23
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+20 -74
operations/puppetproduction+1 -1
operations/puppetproduction+16 -81
operations/puppetproduction+38 -19
operations/puppetproduction+7 -21
operations/puppetproduction+38 -9
operations/puppetproduction+18 -15
operations/puppetproduction+0 -32
operations/puppetproduction+24 -10
operations/puppetproduction+74 -28
operations/puppetproduction+1 -1
operations/puppetproduction+20 -0
operations/puppetproduction+8 -3
operations/puppetproduction+2 -0
operations/puppetproduction+10 -0
operations/puppetproduction+30 -0
operations/puppetproduction+0 -0
operations/puppetproduction+1 -0
operations/puppetproduction+22 -4
operations/puppetproduction+39 -1
Show related patches Customize query in gerrit

Event Timeline

elukey created this task.May 29 2020, 6:34 AM
Milimetric assigned this task to elukey.Jun 4 2020, 4:11 PM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Security Maturity and Data Privacy on the Analytics board.

I recently discovered that we have base::expose_puppet_certs in puppet. The class is aimed to copy the host's puppet TLS certificate and the PuppetCA certificate to a location on disk with certain permissions. Maybe we could have something similar for Java truststores/keystores as well?

@MoritzMuehlenhoff @jbond if you have time, I have an idea to discuss (I see that there is something moving for PKI so this could be relevant/not-needed/etc..). For Hadoop we need to have some TLS certificates that all the daemons use, to allow encrypted comms between workers etc.. I need to refresh all certs before the beginning of October, since when I created them (via cergen) I forgot to set a longer expire time for the root CA. While working on it, I realized that I could re-use the puppet host's certficates in theory, since every worker just holds a TLS cert for its hostname at the moment (with custom root CA created by cergen).

The caveat is that Java requires TLS certs and private keys to be wrapped in truststores/keystores. After seeing base::expose_puppet_certs I am wondering if it could be feasible to have a puppet class that could create the keystores/truststores on the fly if needed (with proper permissions etc..) to avoid the need for a custom solution via cergen. Does it sound crazy or doable? (even in light of the the new PKI infra that is in progress).

Change 623361 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] sslcert::x509_to_pkcs12: add define for creating p12 files

https://gerrit.wikimedia.org/r/623361

Change 623362 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] base::puppet: add ability to create p12 puppet cert

https://gerrit.wikimedia.org/r/623362

Change 623363 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] puppet ssl p12: enable generation of puppet p12 cert on test cluster

https://gerrit.wikimedia.org/r/623363

jbond added a comment.Aug 31 2020, 3:26 PM

I recently discovered that we have base::expose_puppet_certs in puppet. The class is aimed to copy the host's puppet TLS certificate and the PuppetCA certificate to a location on disk with certain permissions. Maybe we could have something similar for Java truststores/keystores as well?

I have created some patches to create a p12 file from the puppet public/private key pair. we could expand this to create a jks however it appears that java can read p12 files so this may be enough? if so can update the expose_puppet_certs to also expose the p12

i see that there is something moving for PKI so this could be relevant/not-needed/etc..

yes indeed im looking at setting up a cfssl server to provide better automation infrastructure for pki in general, however i think it would be quite tight for us to commit to having something ready by October

the CA's cert expiry is October 2020, we need to regenerate it to something longer that this.

you should be able to extend the expiry by resigning the csr similar to how i extended the puppet CA. this would mean you only need to send out the new public root CA and not regenerate all certificates

The discussion of wether to re-use the puppet certs is something i think needs to happen on a more general level although its not worth having that discussion until we know what the new PKI set up look like. That said please ping me on IRC if you want to chat about this more

elukey added a comment.Sep 1 2020, 6:41 AM

I recently discovered that we have base::expose_puppet_certs in puppet. The class is aimed to copy the host's puppet TLS certificate and the PuppetCA certificate to a location on disk with certain permissions. Maybe we could have something similar for Java truststores/keystores as well?

I have created some patches to create a p12 file from the puppet public/private key pair. we could expand this to create a jks however it appears that java can read p12 files so this may be enough? if so can update the expose_puppet_certs to also expose the p12

Yes it may be enough, I'll test it on the Hadoop test cluster. IIUC in the patch you "package" the private key into a p12 file, but I'd also need another one containing the puppetCA's certificate. Is it possible or do you prefer that this code lives elsewhere?

i see that there is something moving for PKI so this could be relevant/not-needed/etc..

yes indeed im looking at setting up a cfssl server to provide better automation infrastructure for pki in general, however i think it would be quite tight for us to commit to having something ready by October

the CA's cert expiry is October 2020, we need to regenerate it to something longer that this.

you should be able to extend the expiry by resigning the csr similar to how i extended the puppet CA. this would mean you only need to send out the new public root CA and not regenerate all certificates

Thanks!

The discussion of wether to re-use the puppet certs is something i think needs to happen on a more general level although its not worth having that discussion until we know what the new PKI set up look like. That said please ping me on IRC if you want to chat about this more

jbond added a comment.Sep 1 2020, 9:54 AM

Yes it may be enough, I'll test it on the Hadoop test cluster. IIUC in the patch you "package" the private key into a p12 file, but I'd also need another one containing the puppetCA's certificate. Is it possible or do you prefer that this code lives elsewhere?

Yes, i have updated the CR to include the CA cert

Change 625624 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::java: add param to toggle puppet ca trust

https://gerrit.wikimedia.org/r/625624

Change 625625 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role:idp_test: add the puppet CA to the java truststore

https://gerrit.wikimedia.org/r/625625

Change 625623 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] java: add define to update the java trust store

https://gerrit.wikimedia.org/r/625623

Change 625630 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] role:idp_test: add remove puppet CA from the java truststore

https://gerrit.wikimedia.org/r/625630

Change 625631 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::java: add the puppet CA cert to the java truststore by default

https://gerrit.wikimedia.org/r/625631

Change 625623 merged by Jbond:
[operations/puppet@production] java: add define to update the java trust store

https://gerrit.wikimedia.org/r/625623

Change 625624 merged by Jbond:
[operations/puppet@production] profile::java: add param to toggle puppet ca trust

https://gerrit.wikimedia.org/r/625624

Change 625625 merged by Jbond:
[operations/puppet@production] role:idp_test: add the puppet CA to the java truststore

https://gerrit.wikimedia.org/r/625625

Change 625630 merged by Jbond:
[operations/puppet@production] role:idp_test: add remove puppet CA from the java truststore

https://gerrit.wikimedia.org/r/625630

Change 623361 merged by Jbond:
[operations/puppet@production] sslcert::x509_to_pkcs12: add define for creating p12 files

https://gerrit.wikimedia.org/r/623361

Change 623362 merged by Jbond:
[operations/puppet@production] base::puppet: add ability to create p12 puppet cert

https://gerrit.wikimedia.org/r/623362

Change 623363 merged by Jbond:
[operations/puppet@production] puppet ssl p12: enable generation of puppet p12 cert on test cluster

https://gerrit.wikimedia.org/r/623363

Change 626125 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] java::cacert: -cacerts is not supported in java 8

https://gerrit.wikimedia.org/r/626125

Change 626125 merged by Jbond:
[operations/puppet@production] java::cacert: -cacerts is not supported in java 8

https://gerrit.wikimedia.org/r/626125

Change 626137 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] base::expose_puppet_certs: add ability to expose p12 cert

https://gerrit.wikimedia.org/r/626137

Change 626137 merged by Jbond:
[operations/puppet@production] base::expose_puppet_certs: add ability to expose p12 cert

https://gerrit.wikimedia.org/r/626137

Change 625631 merged by Jbond:
[operations/puppet@production] profile::java: add the puppet CA cert to the java truststore by default

https://gerrit.wikimedia.org/r/625631

I recently discovered that we have base::expose_puppet_certs in puppet. The class is aimed to copy the host's puppet TLS certificate and the PuppetCA certificate to a location on disk with certain permissions. Maybe we could have something similar for Java truststores/keystores as well?

I have now added a few things to make this possible:

  1. The ability to add additional CA's to the java trust store, which is now used to add the puppet CA to the trust store by default
  2. Create a resource to convert a x509 cert pair to a p12 file and add the ability to convert the puppet certs to a p12 by setting base::puppet::export_p12: true
  3. update base::expose_puppet_certs to support exposing the p12 certificate

This means if a daemon starts as root and you configure base::puppet::export_p12: true you should be able to use the file in "${facts['puppet_config']['ssldir']}/private/${facts['fqdn']}.p12". If the daemon dosn't run as root you can use the base::expose_puppet_certs resource with provide_p12 => true e.g.

base::expose_puppet_certs { '/etc/service_dir':
    provide_p12 => true,
    user            => $service_user,
    group           => $service_group,
}

Forgot to say role analytics_test_cluster::coordinator is currently exporting the p12

Change 628084 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Use local puppet host certificates for Hadoop Test's TLS encryption

https://gerrit.wikimedia.org/r/628084

Change 628084 merged by Elukey:
[operations/puppet@production] Use local puppet host certificates for Hadoop Test's TLS encryption

https://gerrit.wikimedia.org/r/628084

Change 628141 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: refactor puppet TLS cert deployment

https://gerrit.wikimedia.org/r/628141

Change 628141 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: refactor puppet TLS cert deployment

https://gerrit.wikimedia.org/r/628141

The new settings are working on the Testing cluster as far as I can see, really nice!

Procedure wise, this is what I'd do:

  1. As pre-requisite, check that the puppet ca is listed correctly in the default trust-store of the hadoop hosts.
elukey@analytics1031:~$ echo changeit | keytool -list -v -keystore /etc/ssl/certs/java/cacerts | grep puppet
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Enter keystore password:  Alias name: debian:puppet_internal_ca.pem
Alias name: wmf:puppetca.pem
  1. Deploy the change to all the hosts.
  1. Roll restart node managers (shufflers use TLS)
  1. Roll restart journalnodes
  1. and 4) should be fine since the masters do trust the puppet CA (see 1). Other daemons use TLS for their UIs (like datanodes, etc..) but probably not worth restarting for this (waiting instead for the next round of restarts).

Change 628781 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::base::puppet: remove export_p12 function

https://gerrit.wikimedia.org/r/628781

Change 628783 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] base::expose_puppet_certs: update p12 interface

https://gerrit.wikimedia.org/r/628783

Change 628787 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::hadoop::common: migrate to base_exspose_puppet_cert

https://gerrit.wikimedia.org/r/628787

Change 628782 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file

https://gerrit.wikimedia.org/r/628782

Change 628781 merged by Jbond:
[operations/puppet@production] profile::base::puppet: remove export_p12 function

https://gerrit.wikimedia.org/r/628781

Change 628782 merged by Jbond:
[operations/puppet@production] sslcert::x509_to_p12: add owner/group parameters to manage the p12 file

https://gerrit.wikimedia.org/r/628782

Change 628783 merged by Jbond:
[operations/puppet@production] base::expose_puppet_certs: update p12 interface

https://gerrit.wikimedia.org/r/628783

Change 628787 merged by Jbond:
[operations/puppet@production] profile::hadoop::common: migrate to base_exspose_puppet_cert

https://gerrit.wikimedia.org/r/628787

Change 628829 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::presto::server: allow the usage of local puppet TLS certs

https://gerrit.wikimedia.org/r/628829

Change 628829 merged by Elukey:
[operations/puppet@production] profile::presto::server: allow the usage of local puppet TLS certs

https://gerrit.wikimedia.org/r/628829

Change 628850 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] presto: use puppet host TLS certificates by default

https://gerrit.wikimedia.org/r/628850

Change 628850 merged by Elukey:
[operations/puppet@production] presto: use puppet host TLS certificates by default

https://gerrit.wikimedia.org/r/628850

Presto seems not working with the new pkcs12 config, opened: https://github.com/prestodb/presto/issues/15207

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Sep 23 2020, 3:55 PM

Change 629663 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: use only TLS puppet certs

https://gerrit.wikimedia.org/r/629663

Change 629663 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: use only TLS puppet certs

https://gerrit.wikimedia.org/r/629663

Mentioned in SAL (#wikimedia-operations) [2020-09-24T13:22:53Z] <elukey> moved the hadoop cluster to puppet TLS certificates - T253957

Change 630078 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::hadoop::common: cleanup after new TLS settings

https://gerrit.wikimedia.org/r/630078

Change 630078 merged by Elukey:
[operations/puppet@production] profile::hadoop::common: cleanup after new TLS settings

https://gerrit.wikimedia.org/r/630078

I have cleaned up the puppet private repository from all certificates/configs not used, all good.

The remaining step is to figure out why https://github.com/prestodb/presto/issues/15207 happens, and move presto to puppet TLS certificates. The presto's self signed CA expires in 2021 so we should be relatively good for the moment, but better to fix this use case sooner rather than later.

Change 635255 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::coordinator: force presto to use puppet tls certs

https://gerrit.wikimedia.org/r/635255

Change 635255 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::coordinator: force presto to use puppet tls certs

https://gerrit.wikimedia.org/r/635255

Change 635261 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::analytics_cluster::presto::server: force presto to use puppet TLS certs

https://gerrit.wikimedia.org/r/635261

Change 635261 merged by Elukey:
[operations/puppet@production] role::analytics_cluster::presto::server: force presto to use puppet TLS certs

https://gerrit.wikimedia.org/r/635261

Change 635267 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] presto: revert TLS settings

https://gerrit.wikimedia.org/r/635267

Change 635267 merged by Elukey:
[operations/puppet@production] presto: revert TLS settings

https://gerrit.wikimedia.org/r/635267

The main problem with Presto seems to be that the puppet CA is not picked up as trusted source, so any TLS attempt between worker and coordinator fails. This is the output before/after:

Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout        adding as trusted cert:
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout          Subject: CN=an-presto1001.eqiad.wmnet
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout          Issuer:  ST=CA, C=US, CN=root_ca
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout          Algorithm: EC; Serial number: 0x47529ed5a0041964714b33f459947c340cba8000
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout          Valid from Mon Feb 10 08:31:43 UTC 2020 until Tue Feb 09 08:31:37 UTC 2021
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.460Z        INFO        main        stdout        adding as trusted cert:
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.461Z        INFO        main        stdout          Subject: ST=CA, C=US, CN=root_ca
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.461Z        INFO        main        stdout          Issuer:  ST=CA, C=US, CN=root_ca
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.461Z        INFO        main        stdout          Algorithm: RSA; Serial number: 0x323ae0e41931a06193a06da1202a16a2b13b16bb
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.461Z        INFO        main        stdout          Valid from Mon Feb 10 08:31:37 UTC 2020 until Tue Feb 09 08:31:37 UTC 2021
Oct 20 09:36:03 an-presto1001 presto-server[13270]: 2020-10-20T09:36:03.461Z        INFO        main        stdout        System property jdk.tls.client.cipherSuites is set to 'null'
Oct 20 09:59:58 an-presto1001 presto-server[45439]: 2020-10-20T09:59:58.878Z        INFO        main        stdout        adding as trusted cert:
Oct 20 09:59:58 an-presto1001 presto-server[45439]: 2020-10-20T09:59:58.878Z        INFO        main        stdout          Subject: CN=an-presto1001.eqiad.wmnet
Oct 20 09:59:58 an-presto1001 presto-server[45439]: 2020-10-20T09:59:58.878Z        INFO        main        stdout          Issuer:  CN=Puppet CA: palladium.eqiad.wmnet
Oct 20 09:59:58 an-presto1001 presto-server[45439]: 2020-10-20T09:59:58.878Z        INFO        main        stdout          Algorithm: RSA; Serial number: 0x14e4
Oct 20 09:59:58 an-presto1001 presto-server[45439]: 2020-10-20T09:59:58.878Z        INFO        main        stdout          Valid from Tue Sep 17 13:36:12 UTC 2019 until Mon Sep 16 13:36:12 UTC 2024

The keystore created via certgen contains the puppet CA inside it, meanwhile the pkcs/p12 one doesn't. In theory this shouldn't be a problem since the Puppet CA should be trusted by the JVM itself, but for some reason it is not picked up by Presto.

If I add:

-Djavax.net.ssl.trustStore=/etc/ssl/certs/java/cacerts
-Djavax.net.ssl.trustStorePassword=changeit

Then I see from the debug logging that the default certs are picked up, including the puppet ones. I think that the script that launches java for presto does not take into account the default truststore.

Change 635289 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] presto: use puppet TLS certificates instead of the self signed ones

https://gerrit.wikimedia.org/r/635289

Change 635289 merged by Elukey:
[operations/puppet@production] presto: use puppet TLS certificates instead of the self signed ones

https://gerrit.wikimedia.org/r/635289

Change 635299 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] presto: remove unused code

https://gerrit.wikimedia.org/r/635299

Change 635299 merged by Elukey:
[operations/puppet@production] presto: remove unused code

https://gerrit.wikimedia.org/r/635299

Change 635495 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] superset: remove presto TLS config (not needed anymore)

https://gerrit.wikimedia.org/r/635495

Change 635495 merged by Elukey:
[operations/puppet@production] superset: remove presto TLS config (not needed anymore)

https://gerrit.wikimedia.org/r/635495

Documentation updated, finally the task is done!

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Wed, Oct 21, 9:09 AM