Page MenuHomePhabricator

Unify WMF internal CA certs bundle generation
Closed, ResolvedPublic

Description

There seems to be two different ways of retrieving, on all production hosts, the bundle containing the Puppet CA cert and the Root PKI cert:

  1. The wmf-certificates package uses update-ca-certificates at install time to generate /etc/ssl/certs/wmf-ca-certificates.crt.
  2. The profile::base::certificates class uses two crt files provided by wmf-certificates (/usr/share/ca-certificates/wikimedia/Puppet_Internal_CA.crt and /usr/share/ca-certificates/wikimedia/Wikimedia_Internal_Root_CA.crt) to generate /etc/ssl/localcerts/wmf_trusted_root_CAs.pem (basically a concat of the two files).

While working on moving Kafka clients to /etc/ssl/localcerts/wmf_trusted_root_CAs.pem, we realized that in the Kubernetes use case it is better to rely on /etc/ssl/certs/wmf-ca-certificates.crt rather than injecting the bundle to helmfile configs (to make it available to Helm).

The profile::base::certificates does also other important things:

  1. Creates PKCS12 bundles if needed (using openssl to bundle /etc/ssl/localcerts/wmf_trusted_root_CAs.pem into a .p12 file)
  2. Creates Java Truststore bundles if needed. Caveat: the Java keytool command, used to generate the truststore, doesn't accept chained cert files, every certificate needs to be added with a specific call to keytool (otherwise it doesn't work).
  3. We need to support Cloud environments too (like deployment-prep) where test PKI instances are deployed. The profile::base::certificates supports this use case, it is currently configured to use the right Puppet CA and PKI bundle where needed (only deployment-prep for the moment).

It would be nice to use /etc/ssl/certs/wmf-ca-certificates.crt as much as possible in production, to avoid diverging too much from Kube-land. At the same time, both ways of doing things are needed if we want to support Cloud environments.

Event Timeline

I am wondering what is best to do for use cases like:

Basically all librdkafka-based clients need a .pem with a chain of certificates to trust, but we also need to be able to make them work in cloud too. The wmf_trusted_root_CAs.pem seems more ideal as general use case, but I am open to suggestions :)

Like most things cloud related i think its worth splitting this a bit further and say that we have three scenarios to work with production, cloud and deployment-prep. And it is worth noting that for the majority of cloud this is not an issue as they do not have a working pki set up. this is more something that affects deployment-prep and is yet another issue of not having a real staging environment and although i dont want to go down that route it is worth mentioning briefly if only to pint out that pki already has difference when it comes to deployment prep. Specifically we only have on intermediate configured in deployment prep, we don't have intermediates for kafka, discovery or debmonitor. As such there is other work needed to be preformed to get things working properly in deployment prep.

Further its my personal view that things not working in deployment-prep shouldn't block changes to production, this is a bit counter intuative for something that is suppose to be a stagging environment however there are so many differences and at anyone point so many things broken that it is not currently practical.

Ok that said i think that:

  • For the majority of cloud we just need to make sure this is a noop. AFAIK that is currently the case because:
    • we dont install wmf-certificates in cloud
    • profile::base::certificates::trusted_certs: [] which means no combined certificate is used
  • for deployment-prep i think using profile::base::certificates and the $trusted_certs array is fine
  • for production we could:
    • use profile::base::certificates and wmf-certificates
    • only use wmf-certificates and create a new variable pointing to the
Use both profile::base::certificates and wmf-certificates

The benefit of this is that the shared CA certificate will be in the same location regardless of if we are in deployment-prep or production. It also means that we have very little to change. The down side of this approch is that we have a bit of duplication i.e. we would have the same file in /etc/ssl/localcerts/wmf_trusted_root_CAs.pem and /etc/ssl/certs/wmf-ca-certificates.crt

Only wmf-certificates in production

The benefit of this is that we have no duplication however the down side is that the shared CA bundle is in a different location in deployment-prep and cloud as such we would need to track the location in a new variables e.g. profile::base::certificates::trusted_ca::path. This would mean that for changes like 739463 you would need to do the following

"ssl.ca.location="<%= scope.lookupvar('profile::base::certificates::trusted_ca::path') %>"

and for 739806

hieradata/hosts/cp3050.yaml
profile::cache::kafka::certificate::ssl_ca_location: "%{lookup('profile::base::certificates::trusted_ca::path')}"

Third option i just thought of

While righting this it occurred to me that there is a third option, we could simply create a wmf-certificates-beta package which includes the CA bundle needed for deployment-prep. From the puppet side of things i think this is the simplest and the more i think about it i think its the route that makes the most senses. I would expect that deployment-prep, will, at some point, get there own k8s environment which mirrors production and at that point we will have the same problem that lead to the creation of the wmf-certificates package in the first place.

Adding some notes:

  1. A big use case for profile::base::certificates is to create jks truststore for java, that require an entry for every trusted certificate. The wmf-certificates package is currently creating the bundle as hook at install time, IIUC, so we can't really do the same for jks (since it requires openjdk dependencies that we don't have on all nodes).
  1. In the "only wmf-certificate" option, IIUC we should create profile::base::certificates::trusted_ca::path, but what values would it be for say deployment-prep? Do we have a bundle (Puppet CA +PKI certs) that we can use? This is why I was referring to the current logic as possible way to go, since it is transparent to realms and environments (up to what we configure in hiera of course).

In the "only wmf-certificate" option, IIUC we should create profile::base::certificates::trusted_ca::path, but what values would it be for say deployment-prep?

it would be /etc/ssl/localcerts/wmf_trusted_root_CAs.pem ie we would still use profile::base::certificates::trusted_ca to create the bundle in deployment-prep

Thanks a lot @jbond for all the info, I have other questions/doubts in mind, I think that we are close to find a solution but I feel that some things needs to be discussed first.

  1. p12/jks bundles

The wmf-certificates package is currently relying on a install-hook that executes a bash script to concatenate the Puppet CA and Root PKI certs, only after checking that they are allowed in /etc/ca-certificates.conf. If we wanted to add the automatic generation of .p12 or .jks, we'd need java/openssl dependencies available on the host on which we are installing the package, that is not feasible. One solution would be to create the bundles p12/jks at package build time, and basically ship them as static files (as we originally thought in the other task). The main inconsistency, in my opinion, would be that for the main .pem bundle we rely on /etc/ca-certificates.conf, and for p12/jks only on what was built. A simplification would be to avoid the install check and create the pem bundle at build time as well, but there are probably some use cases that I don't have in mind that need the install check.

  1. wmf-certificates vs profile::base::certificates inconsistencies

As described above, the wmf-certificates package checks in /etc/ca-certificates.conf if all the "trusted certs" are allowed, meanwhile profile::base::certificates does not (it just trusts a list of certs).

  1. The wmf-certificates-beta solution

If we had a way to generate multiple package from the same debian source (IIRC there should be the possibility), we could maintain the same logic between the packages varying the certs to ship. I'd be in favor of a solution like this one, since it would allow us to simplify profile::base::certificates a lot, not keeping a lot of code only for the beta use case, but we need to solve problem 1) first in my opinion.

Thoughts?

Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too, and all environments have their own puppet master CAs.

A simplification would be to avoid the install check and create the pem bundle at build time as well, but there are probably some use cases that I don't have in mind that need the install check.

The reason I chose to do it that way was more that it is how it's done for the ca-certificates.crt bundle as well. I also wanted the wmf-ca-certificates bundle to be able to contain other wmf certificates that might be installed on the system.

If we had a way to generate multiple package from the same debian source (IIRC there should be the possibility), we could maintain the same logic between the packages varying the certs to ship. I'd be in favor of a solution like this one, since it would allow us to simplify profile::base::certificates a lot, not keeping a lot of code only for the beta use case, but we need to solve problem 1) first in my opinion.

That would be absolutely possible.

If we had a way to generate multiple package from the same debian source (IIRC there should be the possibility), we could maintain the same logic between the packages varying the certs to ship. I'd be in favor of a solution like this one, since it would allow us to simplify profile::base::certificates a lot, not keeping a lot of code only for the beta use case, but we need to solve problem 1) first in my opinion.

That would be absolutely possible.

I was thinking about this today, in the Pontoon world (that is basically, from this point of view, having the same problem that we have for deployment-prep replicated multiple times) it is probably not feasible to have a dedicated wmf-certificates package for all use cases. I am more inclined to proceed in this way:

  1. In production, use the wmf-certificates bundle as much as possible.
  2. Find a way to create a single source of truth called profile::base::certificates::trusted_ca::path or similar, so that we'll have a single value in puppet that varies on realms.
  3. Keep generating certs via profile::base::certificates for non-prod environments (Pontoon(s), deployment-prep, cloud, etc..)

Caveat: it is still not clear what to do in production when a p12/jks bundle is needed though..

Does it make sense?

Just some early notes ill follow up more in a bit

p12/jks bundles

In this method we would still do the jks/p12 generation in puppet

As described above, the wmf-certificates package checks in /etc/ca-certificates.conf if all the "trusted certs" are allowed, meanwhile profile::base::certificates does not (it just trusts a list of certs).

I don't see this as an issue production should have more protections and dropping the additional checks is not a big deal (edit: further to this both list are ultimately managed by use so its not really a drop in protection/security)

The wmf-certificates-beta solution

We could just have a wmf branch which has the wmf root instead of the production root

Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too,

In relation to this, I want to say that imo having changes work in deployment prep is a nice to have and shouldn't block production changes. having changes work in pontoon should be even less blocking. currently we have spent orders of magnitude more time trying to get things working with the none production use case which is sub optimal. in order to put some type of staging environment (deployment-prep or some type of pontoon environment) in the critical path of change control then the staging environment needs to be treated like and supported as a first class citizen and should mirror production as closely as possible so we don't have theses issues.

and all environments have their own puppet master CAs.

Again although this is nice to have it is is IMO very much *out of scope* there are only a very limited number or projects in cloud that even make use of the could PKI infrastructure, and of the ones that do use it i think only pontoon and deployment-prep make use of it to a level where they would need this support. Ultimately cloud (including pontoon and deployment prep) is not a supported environment and if you (the project owner) are doing your own thing and diverging from what production does, then you have some responsibility to make sure you environment is updated to keep pace with theses changes.

Another use case, brought up this morning, is Pontoon - we should try to keep consistency in there too,

In relation to this, I want to say that imo having changes work in deployment prep is a nice to have and shouldn't block production changes. having changes work in pontoon should be even less blocking. currently we have spent orders of magnitude more time trying to get things working with the none production use case which is sub optimal. in order to put some type of staging environment (deployment-prep or some type of pontoon environment) in the critical path of change control then the staging environment needs to be treated like and supported as a first class citizen and should mirror production as closely as possible so we don't have theses issues.

I disagree with this John, Pontoon was a big effort to allow reusable testing environments for Production and it is now fully supported, if we introduce changes that break its compatibility with the production's puppet code we harm the project's value (IMHO). For deployment-prep I don't have a strong opinion, but I had to work on it since a lot of people consider it the de-facto staging environment, and rely on it for testing (Kafka for eventgate, coal/navtiming/etc.. for Performance). In my case, not supporting deployment-prep would have meant being blocked in moving some kafka clients to the new bundle for example, blocking the whole migration.

and all environments have their own puppet master CAs.

Again although this is nice to have it is is IMO very much *out of scope* there are only a very limited number or projects in cloud that even make use of the could PKI infrastructure, and of the ones that do use it i think only pontoon and deployment-prep make use of it to a level where they would need this support. Ultimately cloud (including pontoon and deployment prep) is not a supported environment and if you (the project owner) are doing your own thing and diverging from what production does, then you have some responsibility to make sure you environment is updated to keep pace with theses changes.

I don't agree on this one too. The code that you wrote for profile::base::certificates work flawlessly between prod/cloud/deployment-prep/pontoon and it is flexible enough to allow different PKI services to be created and used. The approach using wmf-certificates doesn't, since it supports only production and others need to adapt in case they need. So I don't see the value of dropping something that works across realms in favor of something else that work better only for production/k8s, I'd like to make the two converge in something useful for everybody.

All the time that we are spending reaching an agreement will be rewarded when we'll move a ton of services to PKI without anybody noticing that we are doing it :)

Change 741867 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:base::certificates: update support for trusted CA

https://gerrit.wikimedia.org/r/741867

Im not sure this is the place to have this discussion, perhaps we should fork to another task?

I disagree with this John, Pontoon was a big effort to allow reusable testing environments for Production

Don't get me wrong i definitely appreciate the work that has gone into pontoon and make a best effort to ensure it keeps working

and it is now fully supported

Is it? honestly i have not see anything to that affect and from my PoV it has similar issue to deployment-prep in that it continues to break due to subtle difference,

, if we introduce changes that break its compatibility with the production's puppet code we harm the project's value (IMHO)

In which case we need to improve all our tooling PCC, CI code review to ensure breaking changes don't hit pontoon and we are not close to that. As some one who makes a lot of puppet changes i can tell you that i break pontoon just as much as i break cloud and deployment-prep. i not saying i do this intentionally but we dont have the same level of checking to ensure this compatibility and until we do i (and others) will continue to break theses systems and that compatiablity

In my case, not supporting deployment-prep would have meant being blocked in moving some kafka clients to the new bundle for example, blocking the whole migration.

Again i appreciate that deployment prep is very important and like pontoon and cloud in general i always go out of my way to try and ensure comparability across all environments.
however unless the organisation names and invests in a proper staging environment and ensures it has support and compatibility as a first class citizen then we will all be eating our own tails. Ultimately i think this is the key point here. We don't have a proper Officially supported staging environments, that is unless i missed an announcement on pontoon and if i did there is still a lot of work to actually make it supported. What we have now.

  • pontoon, created and supported by fillipo (i believe largely in his own time)
    • When things break here they often get noticed in ~1-7 days depending on where the breakage is
    • fixes are generally made myself of fiillipo
    • no CI considerations
    • 0 support in PCC
    • not considered during review (unless filippo is a review)
  • deployment-prep: i honestly have no idea who owns this
    • When things break here they often get noticed in 1day-never depending on where the breakage is
    • fixes are generally made mysel, a volunteer or someone familiar with the affected code
    • Some CI checks which specifically test for the cloud case
    • Has support in PCC but has some issues
    • considered during review based on reviews (often not considered)
  • WIP dynamic environments
    • this is not really a staging environment but i mention it here as it is yet another way users could do something similar to pontoon/deployment-prep

Ultimately we now have two sort of staging environments both with there own subtle nuances and neither fully supported in our change review process. My personal view is that we should start with a completely fresh page build a staging environments using the lessons learnt from both deployment-prep and pontoon, and ideally integrating the dynamic environments work. What ever we build should be a much closer match to our real production environments so that we can spend our effort on actually writing and testing code for production instead of writing and testing workarounds to deal with environments subtleties.

I don't agree on this one too.

Well this is simply not the case. the pki solution has not been designed with multi tenancy in mind, it has some hacks so that it can work with environments that are meant to replicate production. but it is absolute in no way anywhere near been close to being a multi tenancy PKI solution for cloud projects in general. The cfssl module should work in cloud and users should be able to spin up there own pki server with there own config. however the profiles are very much designed to work with our production environment and they make that assumption in some of the code paths. As such if someone wants to use th profile code in there own project then it is up to them to make sure they are aware and implement such assumptions and also keep track of changes. it is not up to me (as the maintainer of the pki code) to be aware of every project that might have implemented the pki profile, validate how they have implemented and ensure i remain compatible with there assumptions (along with my own)

This in my opinion a critical point. In an ideal world core modules (all but role and profile) should be flexible and should work in the general case however profiles by definitions are specific to an application or environment and roles even more so. While i think profiles should be flexible and we should be able to toggle all the various bits by hiera, ultimately i think its reasnable to assume they are only supported in production (and some officially supported staging environments) they should not be expected to work in the general case.

The code that you wrote for profile::base::certificates work flawlessly between prod/cloud/deployment-prep/pontoon and it is flexible enough to allow different PKI services to be created and used.

Kind words, however that is with a few hacks, which work well for things trying to replicate production not the general case.

So I don't see the value of dropping something that works across realms

The key point here is that it doesn't work across all realms it was designed with the production use case in mind and works well in production. It has had some hacks added to it (specifically the profile running in cloud) to allow deployment-prep and pontoon to use the production use case. however ill say again it is very fair from a general purpose multi tenancy PKI solution.

in favour of something else that work better only for production/k8s,

Ultimately* because the production network is where my roles responsibilities lie

This along with most of my response probably comes across a bit harsher then i really feel but im trying to clarify that this is not supported by the organisation and management structures. I would not get an OKR approved to create a general purpose multi tenency PKI solution for cloud. largely because I'm not sure the businesses case for such a thing has been proved but also very much because im part of the foundations sre team and not wmcs (i have no idea if wmcs would approve this either).

I'd like to make the two converge in something useful for everybody.

So with all that said, i should say that i honestly do make a lot of effort to try and get code working in all environments. Imo all code written here should work in production, cloud (regardless of deployment-prep, pontoon etc) and be generally usable by the wider pubic. however that is not how our code is written and the view on that point various massively within SRE and the organisation. Further getting anywhere near close that point is very far away, there are still many hardcoded things relating to production and WMF hardcoded into our code base so although it is a noble goal it is a long road and i think easing the issues for now is about the best we can do in terms of prioritisation organisation

In conclusion:

  • I think that we should have an officially supported staging environment which works with all the tooling that production uses including things like PCC, a k8s support
  • pontoon is great but weather is has been given the official stamp or not it is going to suffer from the exact same issues as deployment-prep unless we get it considered in the change process and treated as a first class citizen
  • Production profiles and roles should not be expected to work in $random cloud project

Change 741917 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] P:cache::kafka::Webrequest: use cert defined in P:certificates

https://gerrit.wikimedia.org/r/741917

I am wondering what is best to do for use cases like:

Basically all librdkafka-based clients need a .pem with a chain of certificates to trust, but we also need to be able to make them work in cloud too. The wmf_trusted_root_CAs.pem seems more ideal as general use case, but I am open to suggestions :)

I have created a new change and rebased or re-issued the above changes which tries to take a best of both worlds approach. See the CR starting here

One thing that seems worth highlighting is that at some point i assume we will also need to have the jks/p12 trustore in k8s machines. As such we may ultimately end up creating a java-wmf-certificates package. however i think the current PS should work with prod, deployment-prep, pontoon and likely other cloud environments (assuming they configure hiera correctly)

Change 741867 merged by Jbond:

[operations/puppet@production] P:base::certificates: update support for trusted CA

https://gerrit.wikimedia.org/r/741867

Change 741917 merged by Jbond:

[operations/puppet@production] P:cache::kafka::Webrequest: use cert defined in P:certificates

https://gerrit.wikimedia.org/r/741917

Change 742672 had a related patch set uploaded (by Elukey; author: Elukey):

[analytics/refinery@master] gobblin: move to the new canonical bundle jks location

https://gerrit.wikimedia.org/r/742672

Change 742673 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move kafkatee instances to the new CA bundle location

https://gerrit.wikimedia.org/r/742673

Change 742674 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move coal, navtiming and statsv to the new canonical CA bundle path

https://gerrit.wikimedia.org/r/742674

Very weird result in deployment-prep:

elukey@deployment-webperf11:~$ sudo rm /etc/ssl/localcerts/WMF_TEST_CA.pem

elukey@deployment-webperf11:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for deployment-webperf11.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(872008fa68) root - deployment-prep: install php 7.4 on a mw appserver'
Notice: /Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]/ensure: defined content as '{md5}6374ff663c61cc49e3a51a66efe4a5da' (corrective)
Info: /Stage[main]/Sslcert::Trusted_ca/File[/etc/ssl/localcerts/WMF_TEST_CA.pem]: Scheduling refresh of Exec[generate trusted_ca]
Notice: /Stage[main]/Sslcert::Trusted_ca/Exec[generate trusted_ca]: Triggered 'refresh' from 1 event
Notice: The LDAP client stack for this host is: classic/sudoldap
Notice: /Stage[main]/Profile::Ldap::Client::Labs/Notify[LDAP client stack]/message: defined 'message' as 'The LDAP client stack for this host is: classic/sudoldap'
Notice: Applied catalog in 9.67 seconds

elukey@deployment-webperf11:~$ ls -l /etc/ssl/certs/wmf-ca-certificates.crt
-rw------- 1 root root 3188 Nov 30 10:14 /etc/ssl/certs/wmf-ca-certificates.crt

Change 742690 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] sslcert::trusted_ca: ensure cert bundle readability for group/others

https://gerrit.wikimedia.org/r/742690

Change 742690 merged by Elukey:

[operations/puppet@production] sslcert::trusted_ca: ensure cert bundle readability for group/others

https://gerrit.wikimedia.org/r/742690

Change 742724 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Add helper functions to retrieve CA bundle paths/passwords

https://gerrit.wikimedia.org/r/742724

Change 742725 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::kafka::broker: use new get ca bundle path helpers

https://gerrit.wikimedia.org/r/742725

Change 742672 merged by Elukey:

[analytics/refinery@master] gobblin: move to the new canonical bundle jks location

https://gerrit.wikimedia.org/r/742672

Change 742724 merged by Elukey:

[operations/puppet@production] Add helper functions to retrieve CA bundle paths/passwords

https://gerrit.wikimedia.org/r/742724

Change 742673 merged by Elukey:

[operations/puppet@production] Move kafkatee instances to the new CA bundle location

https://gerrit.wikimedia.org/r/742673

elukey closed this task as Resolved.EditedNov 30 2021, 4:05 PM
elukey claimed this task.

Summary:

The profile::base::certificates code is now able to work transparently for Pontoon/Deployment-Prep/Production:

  • in production, the profile just uses what provided by the package wmf-certificates without adding any extra file.
  • in deployment-prep, the profile creates the bundle using the local PKI+Puppet Root CA certs (without relying on the package wmf-certificates that doesn't make sense in there)
  • in Pontoon the bundle just contain the Puppet CA certificates.

The good thing is that we can now use the following helpers in the puppet code to retrieve the .crt/.jks bundles:

  • profile::base::certificates::get_trusted_ca_path()
  • profile::base::certificates::get_trusted_ca_jks_path()
  • profile::base::certificates::get_trusted_ca_jks_password()

In kubernetes-land we'll rely on the wmf-certificates package.

In this way our code should run and be testable on Pontoon and Deployment-prep without too many swearing or moments of internal sadness.

Change 742674 merged by Elukey:

[operations/puppet@production] Move coal, navtiming and statsv to the new canonical CA bundle path

https://gerrit.wikimedia.org/r/742674

Change 742725 merged by Elukey:

[operations/puppet@production] profile::kafka::broker: use new get ca bundle path helpers

https://gerrit.wikimedia.org/r/742725