Page MenuHomePhabricator

Phase out cergen for Fundraising services
Closed, ResolvedPublic

Description

cergen is our legacy tooling to manage/generate TLS certificates (https://wikitech.wikimedia.org/wiki/Cergen). It has been replaced by an installation of cfssl (https://wikitech.wikimedia.org/wiki/PKI) and the majority of services uses it.

Fundraising uses a client certificate generated with cergen for its kafaktee instance, which consumes from the kafka-jumbo cluster. Historically Fundraising Tech Ops generates the certificate in production, and imports it to the fundraising puppet-private repository.

If we can continue to manually generate a certificate with cfssl that will be fine for our purposes.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Hi Jeff!

After a chat with Moritz we agreed that the simplest solution would be to create the cert via puppet in production on some host (we need to figure out which one) and then you can copy the necessary files over to Fundraising when needed.

Caveat: the TLS certs issued by the Kafka intermediate CA last for 180 days, so puppet will create a new cert twice a year.

The code to generate the client cert should be the same (or very similar) to what we do for varnishkafka in profile::cache::kafka::certificate.

As for the host where to export the keys, the cumin hosts seems like the best choice.

Hi Jeff!

After a chat with Moritz we agreed that the simplest solution would be to create the cert via puppet in production on some host (we need to figure out which one) and then you can copy the necessary files over to Fundraising when needed.

Caveat: the TLS certs issued by the Kafka intermediate CA last for 180 days, so puppet will create a new cert twice a year.

The code to generate the client cert should be the same (or very similar) to what we do for varnishkafka in profile::cache::kafka::certificate.

Sounds fine to me. I looked at the puppet code and if I understand correctly, cfssl::cert will automatically generate a new certificate 10 (default) days before expiration. Hopefully we can figure out how to make puppet send a notification when the new cert is available, so we can fetch and deploy it in frack.

Sounds fine to me. I looked at the puppet code and if I understand correctly, cfssl::cert will automatically generate a new certificate 10 (default) days before expiration. Hopefully we can figure out how to make puppet send a notification when the new cert is available, so we can fetch and deploy it in frack.

One option would be to add a daily systemd::timer::job in Puppet which runs a short script which checks the remaining validity of the frack cert and if it's less than ten days, sends an email notification (via the $send_mail and $send_mail_to parameters) to some frtech mail alias.

For reviewing and merging patches, simply add me as reviewer, happy to happy/unblock what is needed.

Change #1030018 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add a class to Cumin hosts which generates a Kafka certificate for frtech

https://gerrit.wikimedia.org/r/1030018

Change #1030018 merged by Muehlenhoff:

[operations/puppet@production] Add a class to Cumin hosts which generates a Kafka certificate for frtech

https://gerrit.wikimedia.org/r/1030018

Change #1030917 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] profile::frtech::kafka_certificate: Fix owner

https://gerrit.wikimedia.org/r/1030917

Change #1030917 merged by Muehlenhoff:

[operations/puppet@production] profile::frtech::kafka_certificate: Fix owner

https://gerrit.wikimedia.org/r/1030917

@Dwisehaupt @Jgreen The kafka cert issued by the PKI is now getting deployed to /etc/fr-tech-kafka-client on cumin1002/cumin2002. Could you please sync it to fr-tech and test/deploy instead of the old cergen-issued cert? When this has been confirmed to work fine, we can add a systemd timer which sends a notification if the key renewal is forthcoming.

@Dwisehaupt @Jgreen The kafka cert issued by the PKI is now getting deployed to /etc/fr-tech-kafka-client on cumin1002/cumin2002. Could you please sync it to fr-tech and test/deploy instead of the old cergen-issued cert? When this has been confirmed to work fine, we can add a systemd timer which sends a notification if the key renewal is forthcoming.

@MoritzMuehlenhoff We've switched over to the new cert.

Jgreen claimed this task.

Hmmmm. I'm seeing this occasionally:

Kafka error (-195): ssl://kafka-jumbo1010.eqiad.wmnet:9093/1010: Connect to ipv4#10.64.130.10:9093 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed)

Only from kafka-jumbo1010 and kafka-jumbo1011. Kafkatee is receiving content, although I'm not sure yet whether from these two brokers.

This isn't limited to the two hosts I mentioned above. It seems like at any given time one of the kafka-jumbo is refusing connections.

Looks like coincidental rolling kafka restarts.

@Jgreen: Just to doublecheck, the certificate expiry is tracked via monitoring internal to fr-tech (or some manual equialent like a gcal entry)? Just want to make sure nothing else is needed before I clean out the old cergen definitions for the previously used Kafka cert.

Change #1037075 had a related patch set uploaded (by Jgreen; author: Jgreen):

[operations/puppet@production] Add an icinga/nsca collector for Fundraising kafka client cert expire check.

https://gerrit.wikimedia.org/r/1037075

@Jgreen: Just to doublecheck, the certificate expiry is tracked via monitoring internal to fr-tech (or some manual equialent like a gcal entry)? Just want to make sure nothing else is needed before I clean out the old cergen definitions for the previously used Kafka cert.

I added a local nagios/icinga check, once the above-linked commit is merged we should be good to go.

Change #1037075 merged by Filippo Giunchedi:

[operations/puppet@production] Add an icinga/nsca collector for Fundraising kafka client cert expire check.

https://gerrit.wikimedia.org/r/1037075

@Jgreen: Just to doublecheck, the certificate expiry is tracked via monitoring internal to fr-tech (or some manual equialent like a gcal entry)? Just want to make sure nothing else is needed before I clean out the old cergen definitions for the previously used Kafka cert.

I added a local nagios/icinga check, once the above-linked commit is merged we should be good to go.

With the patch merged, can you please doublecheck the cert monitoring is now working fine for you? Then as the last cleanup step, I'd go ahead and remove the legacy cert from cergen.

@Jgreen: Just to doublecheck, the certificate expiry is tracked via monitoring internal to fr-tech (or some manual equialent like a gcal entry)? Just want to make sure nothing else is needed before I clean out the old cergen definitions for the previously used Kafka cert.

I added a local nagios/icinga check, once the above-linked commit is merged we should be good to go.

With the patch merged, can you please doublecheck the cert monitoring is now working fine for you? Then as the last cleanup step, I'd go ahead and remove the legacy cert from cergen.

@MoritzMuehlenhoff cert monitoring looks good, icinga is reporting the check correctly.

@MoritzMuehlenhoff cert monitoring looks good, icinga is reporting the check correctly.

Ack, I've just removed the cergen certs formerly used by the Kafka fundraising client.