Page MenuHomePhabricator

kafka-main certificates expiring on 2024-04-04
Closed, ResolvedPublic

Description

From https://alerts.wikimedia.org/?q=alertname%3DKafka%20broker%20TLS%20certificate%20validity

kafka-main broker certificates apparently expire in less than 2 weeks. We 'll need to renew them before they expire otherwise we risk a big outage as jobs will, presumably, not be submittable.

Event Timeline

akosiaris moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.
akosiaris added a subscriber: brouberol.

Adding @brouberol as they probably have way more experience than serviceops on refreshing kafka certificates than anyone in serviceops

Let me have a look at how these certificates are generated. I'm thinking we should renew them and trigger a rolling-restart of the cluster.

Edit: here's the runbook

brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093  | openssl x509 -issuer -nout
x509: Unrecognized flag nout
x509: Use -help for summary.
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
verify return:1
depth=0 CN = kafka-main2001.codfw.wmnet
verify return:1
DONE

The runbook mentions

If the CA mentioned is:

  • the Puppet one, then you'll need to follow Cergen#Update_a_certificate and deploy the new certificate to all nodes.
  • the Kafka PKI Intermediate one, then in theory a new certificate should be issued few days before the expiry and puppet should replace the Kafka keystore automatically (under /etc/kafka/ssl).

@akosiaris do you happen to know which one it is in that case? It's not obvious to me. Thanks!

I'd tend to say Kafka PKI Intermediate due to depth=1 CN=kafka but a confirmation would be perfect.

brouberol@kafka-main2001:~$ echo y | openssl s_client -connect $(hostname -f):9093  | openssl x509 -issuer -nout
x509: Unrecognized flag nout
x509: Use -help for summary.
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = kafka
verify return:1
depth=0 CN = kafka-main2001.codfw.wmnet
verify return:1
DONE

The runbook mentions

If the CA mentioned is:

  • the Puppet one, then you'll need to follow Cergen#Update_a_certificate and deploy the new certificate to all nodes.
  • the Kafka PKI Intermediate one, then in theory a new certificate should be issued few days before the expiry and puppet should replace the Kafka keystore automatically (under /etc/kafka/ssl).

@akosiaris do you happen to know which one it is in that case? It's not obvious to me. Thanks!

I'd tend to say Kafka PKI Intermediate due to depth=1 CN=kafka but a confirmation would be perfect.

I can't see anything in /srv/private on puppetmaster1001 where the cergen certs are. And the cergen phasing out task, T357750 doesn't list kafka-main so I am pretty sure it's the Kafka PKI intermediate. To add credence to that

kafka-main1001:/etc/kafka/ssl$ openssl x509 -in kafka__kafka-main1001_eqiad_wmnet_kafka_11.chained.pem -startdate -enddate
notBefore=Mar  6 14:09:00 2024 GMT
notAfter=Mar  6 14:09:00 2025 GMT

So we apparently have the certs already issued and we only lack the rolling restart?

Edit: here's the runbook

Thanks!

So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue

sudo cookbook sre.kafka.roll-restart-reboot-brokers \
    --alias kafka-main \
    --reason 'certificate reissue' \
    restart_daemons

Luca migrated kafka/main to the PKI in https://phabricator.wikimedia.org/T319372 and he left a comment to that regard on the task:

What is it going to change after the move to PKI ?
The only annoyance is that every 6 months we'll need to run the kafka roll restart cookbook to pick up the new TLS certificates, since the PKI ones last 6 months for the moment. This is due to the current version of Kafka that doesn't allow hot reload of keystores: https://wikitech.wikimedia.org/wiki/Kafka/Administration#Renew_TLS_certificate

So, since I 've never done this before (that I remember of), double check me on this please. Is it just enough to issue

sudo cookbook sre.kafka.roll-restart-reboot-brokers \
    --alias kafka-main \
    --reason 'certificate reissue' \
    restart_daemons

That looks good. As a sidenote, I've recently changed that runbook to make sure we only move on to the next broker when the one that was just restarted is back to full health, which should be safer.

akosiaris claimed this task.

Alerts gone, I 'll resolve this.

As a note to anyone seeing this in the future, it's kafka-main-eqiad, kafka-test-eqiad and kafka-main-codfw as the argument to --alias above.

FWIW I just went through a similar triage and broker restart process in T358870

It wasn't super obvious at first that all was needed was a rolling restart, because the expiry warning actually fired before puppet auto-renewed the certs on disk. I patched that so broker certs auto-renew 30d before expiry instead of ~10d, but still not super intuitive.

Wont reopen this task about it, but very open to ideas/conversations to improve these alerts or the automation underneath it if anyone has ideas!