Page MenuHomePhabricator

Phase out cergen for Search Platform services
Closed, ResolvedPublic

Description

cergen is our legacy tooling to manage/generate TLS certificates (https://wikitech.wikimedia.org/wiki/Cergen). It has been replaced by an installation of cfssl (https://wikitech.wikimedia.org/wiki/PKI) and the majority of services uses it.

Our cergen installation is co-hosted on one of the Puppet master (5) frontends (puppetmaster1001), which runs Buster. cergen is based on legacy libraries (it uses networkx v1, which is incompatible with current networkx releases (networkx 2 was released in 2017) and even when the puppetmasters were moved to Buster, this needed a hack to build a co-installable legacy package in a compomnent (T235405).

Instead of forward-porting it yet again to the new installation we'll use the Puppet 5 -> Puppet 7 migration to also phase out cergen and only use cfssl.

Most of those certs are used by Envoy and our Puppet integration makes switching relatively straightforward by switching the profile::tlsproxy::envoy::ssl_provider Hiera flag to "cfssl" (along with specifying SNI names via profile::tlsproxy::envoy::cfssl_options/hosts)

Some examples for this can be found at
https://github.com/wikimedia/operations-puppet/commit/66fbddeac3a4b2dfa1d8e19a49cc649dcb745f18
https://github.com/wikimedia/operations-puppet/commit/a00d0441b4509e736d8abd6ff63f25224e306239

For use cases outside of Envoy the profile::pki::get_cert define provides a convenient method to request certificates. An example how the gradual migration was implemented for the Ganeti RAPI endpoint can be found at https://github.com/wikimedia/operations-puppet/commit/98350d2dff51bb9bf57263fe50f409374892ae1d

There are currently 5 certificate YAML specs defined in /srv/private/modules/secret/secrets/certificates/certificate.manifests.d which need to be moved to PKI/cfssl. Some services are likely also ported already and only the YAML spec file and the legacy certs were forgotten and fixing it might be a simple as removing the legacy cert material.

nginx based
envoy based

Event Timeline

Gehel triaged this task as Medium priority.Mar 20 2024, 9:00 AM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.

I am starting by looking at the relforge cluster. I see that the certificates are served by nginx and they are still using the puppet CA based certificates.

btullis@relforge1003:/etc/nginx$ openssl x509 -in /etc/ssl/localcerts/relforge.svc.eqiad.wmnet.chained.crt -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 7899 (0x1edb)
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Puppet CA: palladium.eqiad.wmnet
        Validity
            Not Before: Mar 18 02:55:32 2021 GMT
            Not After : Mar 18 02:55:32 2026 GMT
        Subject: CN = relforge.svc.eqiad.wmnet

I'll check to see if there is any code ready to deploy cfssl based certificates for nginx.

I'll check to see if there is any code ready to deploy cfssl based certificates for nginx.

John added support for using cfssl as the provider used by profile::elasticsearch::cirrus::ssl_provider two years ago, but it's not yet used.

Change #1023426 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch relforge certificates from cergen to pki

https://gerrit.wikimedia.org/r/1023426

Change #1023440 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Replace tabs with 4 spaces in tlsproxy nginx.conf

https://gerrit.wikimedia.org/r/1023440

I have a whitespace-only change in the nginx configuration for tlsproxy here: https://gerrit.wikimedia.org/r/1023440
It looks safe to me, but since it touches all the maps servers and every elasticsearch::cirrus server, I think that I had better get a review from @hnowlan and either @bking or @RKemper.

I have tried out the new cfssl support in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1023426 but unless I'm mistaken I think that we will need to modify profile::elasticsearch::cirrus a little. The server names and aliases aren't being passed through to the cfssl based certificate and I'm not sure that the resulting nginx config will be correct.

image.png (180×816 px, 41 KB)

Change #1023469 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add server aliases to the cirrus/cfssl proxy config

https://gerrit.wikimedia.org/r/1023469

Change #1023813 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch elasticsearch::cirrus tlsproxy to pki

https://gerrit.wikimedia.org/r/1023813

Change #1023815 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch the wcqs tlsproxy to use pki

https://gerrit.wikimedia.org/r/1023815

Change #1023819 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch wdqs::public tlsproxy from cergen to pki

https://gerrit.wikimedia.org/r/1023819

Change #1023825 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Switch wdqs::internal tlsproxy from cergen to pki

https://gerrit.wikimedia.org/r/1023825

BTullis updated the task description. (Show Details)

Change #1023440 merged by Bking:

[operations/puppet@production] Replace tabs with 4 spaces in tlsproxy nginx.conf

https://gerrit.wikimedia.org/r/1023440

Change #1023815 merged by Btullis:

[operations/puppet@production] Switch the wcqs tlsproxy to use pki

https://gerrit.wikimedia.org/r/1023815

Change #1023825 merged by Btullis:

[operations/puppet@production] Switch wdqs::internal tlsproxy from cergen to pki

https://gerrit.wikimedia.org/r/1023825

Change #1023819 merged by Btullis:

[operations/puppet@production] Switch wdqs::public tlsproxy from cergen to pki

https://gerrit.wikimedia.org/r/1023819

Change #1023469 merged by Btullis:

[operations/puppet@production] Add server aliases to the cirrus/cfssl proxy config

https://gerrit.wikimedia.org/r/1023469

Change #1023426 merged by Bking:

[operations/puppet@production] Switch relforge certificates from cergen to pki

https://gerrit.wikimedia.org/r/1023426

We rolled out the change to relforge. It works but the Icinga checks on certificate expiry triggered because they fire on the discovery intermediate's default certificate expiry.

image.png (531×1 px, 207 KB)

I think it's best to fix this before we roll it out to the main elasticsearch::cirrus clusters.

Change #1024420 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove obsolete certs for wdqs/wcqs

https://gerrit.wikimedia.org/r/1024420

Change #1024421 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Remove obsolete dummy certs

https://gerrit.wikimedia.org/r/1024421

Change #1024481 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: Configure alerts for short-lived certs

https://gerrit.wikimedia.org/r/1024481

Change #1024421 merged by Muehlenhoff:

[labs/private@master] Remove obsolete dummy certs

https://gerrit.wikimedia.org/r/1024421

Change #1024420 merged by Muehlenhoff:

[operations/puppet@production] Remove obsolete certs for wdqs/wcqs

https://gerrit.wikimedia.org/r/1024420

Change #1025775 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove obsolete certificate

https://gerrit.wikimedia.org/r/1025775

Change #1025777 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Remove obsolete stub cert

https://gerrit.wikimedia.org/r/1025777

Change #1023813 merged by Bking:

[operations/puppet@production] Switch elasticsearch::cirrus tlsproxy to pki

https://gerrit.wikimedia.org/r/1023813

Change #1024481 merged by Bking:

[operations/puppet@production] elasticsearch: Configure alerts for short-lived certs

https://gerrit.wikimedia.org/r/1024481

As of yesterday, the production Elastic clusters are using CFSSL, which means we've accomplished our migration off of Cergen. Thanks to @BTullis
and everyone else who helped with this.

Change #1025777 merged by Muehlenhoff:

[labs/private@master] Remove obsolete stub cert

https://gerrit.wikimedia.org/r/1025777

Change #1025775 merged by Muehlenhoff:

[operations/puppet@production] Remove obsolete certificate

https://gerrit.wikimedia.org/r/1025775

Change #1026438 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove obsolete cert

https://gerrit.wikimedia.org/r/1026438

Change #1026439 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[labs/private@master] Remove obsolete dummy cert

https://gerrit.wikimedia.org/r/1026439

Change #1026803 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] elasticsearch: Remove support for sslcert SSL provider

https://gerrit.wikimedia.org/r/1026803

Change #1026439 merged by Muehlenhoff:

[labs/private@master] Remove obsolete dummy cert

https://gerrit.wikimedia.org/r/1026439

Change #1026438 merged by Muehlenhoff:

[operations/puppet@production] Remove obsolete cert

https://gerrit.wikimedia.org/r/1026438

Change #1026803 merged by Muehlenhoff:

[operations/puppet@production] elasticsearch: Remove support for sslcert SSL provider

https://gerrit.wikimedia.org/r/1026803

Change #1029121 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] elasticsearch::tlsproxy: Stop passing certs to tlsproxy::localssl

https://gerrit.wikimedia.org/r/1029121

Change #1029121 merged by Muehlenhoff:

[operations/puppet@production] elasticsearch::tlsproxy: Stop passing certs to tlsproxy::localssl

https://gerrit.wikimedia.org/r/1029121