Page MenuHomePhabricator

Rotate discovery intermediate certificate (expires 2026-05-03)
Open, HighPublic

Description

I was chasing down certificate issues related to T419289, when I noticed that the dse-k8s issuer's intermediate certificate expires in early May:

gnutls-cli --print-cert opensearch-semantic-search.svc.eqiad.wmnet:30443  | grep -i expire
- subject `CN=opensearch-semantic-search.discovery.wmnet', issuer `CN=discovery,OU=SRE Foundations,O=Wikimedia Foundation\, Inc,L=San Francisco,C=US', serial 0x5672410c6fc4f152d8f03362685dddaf0b60997c, RSA key 2048 bits, signed using ECDSA-SHA512, activated `2026-03-13 06:20:00 UTC', expires `2026-04-10 06:20:00 UTC', pin-sha256="IXvDF4K9qvKm/oQEH191dDBC+Wav2jSNZGSKAV2ARLU="
- subject `CN=discovery,OU=SRE Foundations,O=Wikimedia Foundation\, Inc,L=San Francisco,C=US', issuer `CN=Wikimedia_Internal_Root_CA,OU=Cloud Services,O=Wikimedia Foundation\, Inc,L=San Francisco,ST=California,C=US', serial 0x715331115b69e7112b0e3c7f8c89ce15c51a4639, EC/ECDSA key 528 bits, signed using ECDSA-SHA512, activated `2021-05-04 13:54:00 UTC', expires `2026-05-03 13:54:00 UTC', pin-sha256="PbgfDlEHVB4Zw0a42zNqqnEQbcYF9jYp/dbT4eSdOb8="

Creating this ticket to:

  • Find and read relevant wikitech docs
  • Consult with IF/Service Ops if necessary
  • Rotate intermediate certificate before 2026-05-03 13:54:00 UTC
  • Change the alerting for certificate expiry to create a task

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+0 -56
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+1 -3
operations/deployment-chartsmaster+3 -6
operations/puppetproduction+11 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+26 -24
operations/puppetproduction+1 -1
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+2 -0
operations/puppetproduction+2 -2
operations/puppetproduction+6 -0
operations/puppetproduction+1 -0
operations/puppetproduction+3 -0
operations/puppetproduction+3 -0
operations/deployment-chartsmaster+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+12 -12
labs/privatemaster+28 -0
operations/puppetproduction+55 -0
operations/puppetproduction+22 -0
operations/puppetproduction+1 -0
operations/deployment-chartsmaster+7 -1
operations/puppetproduction+2 -2
operations/puppetproduction+12 -12
operations/puppetproduction+13 -0
operations/puppetproduction+12 -12
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@elukey was nice enough to let me help out today, here's what we found on wdqs2025 after updating its hiera values to use the new intermediate:

/etc/cfssl/csr/discovery__query-experimental_eqiad_wmnet_server.csr is removed but /etc/cfssl/csr/discovery2026__query-experimental_eqiad_wmnet_server.csr doesn't exist yet when cfssl gencert runs. I'm guessing this will be fixed early next week, but for now I'm pretty sure that just running Puppet twice would be good enough to refresh the certificate. I'll try another test host and post an update.

To keep archives happy - this should be fixed with https://gerrit.wikimedia.org/r/1277175

On a related topic, do you think it would be useful to add the intermediates to the wmf-certificates deb pkg? Currently it only contains the root CA.

Yeah this is by design, so we keep the clients as light as possible. The services run with the chained certificate that contains both leaf and intermediate, so on the client it is sufficient to deploy the Root CA only.

Promised update: I added a few more hosts for testing. It turns out that running Puppet twice isn't enough on its own, and Envoy as a service proxy requires more config than Envoy as a TLS terminator. More details in P91477.

@elukey wdqs2026 and/or wdqs2027 have not been touched other than applying the puppet patches above. If want to use them to test next week, feel free. Just be sure to depool them.

Thanks for the tests! The bug should be fixed, I'll check those hosts today!

Change #1277438 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] envoyproxy: trigger the envoy's config re-creation if deleted

https://gerrit.wikimedia.org/r/1277438

Change #1275960 abandoned by Elukey:

[operations/puppet@production] Move netbox, debmonitor and presto to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1275960

Change #1277449 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] Move netbox and presto to the new PKI intermediate

https://gerrit.wikimedia.org/r/1277449

Change #1277452 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::tlsproxy::envoy: add condition to cfss base options

https://gerrit.wikimedia.org/r/1277452

Change #1277458 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] ganeti: Make the cfssl label configurable via Hiera

https://gerrit.wikimedia.org/r/1277458

Change #1277458 merged by Muehlenhoff:

[operations/puppet@production] ganeti: Make the cfssl label configurable via Hiera

https://gerrit.wikimedia.org/r/1277458

Change #1277484 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Move ganeti-test to the 2026 PKI discovery intermediate

https://gerrit.wikimedia.org/r/1277484

Change #1277506 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::mediabackup: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277506

Change #1277507 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::puppetdb: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277507

Change #1277508 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::opensearch::cirrus::server: move to a new pki intermediate

https://gerrit.wikimedia.org/r/1277508

Change #1277509 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::hcaptcha: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277509

Change #1277510 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::dragonfly: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277510

Change #1277511 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::docker_registry: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277511

Change #1277512 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::cache::purge: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277512

Change #1277513 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] profile::etcd::tlsproxy: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277513

Change #1277484 merged by Muehlenhoff:

[operations/puppet@production] Move ganeti-test to the 2026 PKI discovery intermediate

https://gerrit.wikimedia.org/r/1277484

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: testvm2002.codfw.wmnet

  • testvm2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox
    • Removed from DebMonitor
    • Removed from Puppet server and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw_test to Netbox

Change #1277438 merged by Elukey:

[operations/puppet@production] envoyproxy: trigger the envoy's config re-creation if deleted

https://gerrit.wikimedia.org/r/1277438

Change #1277452 merged by Elukey:

[operations/puppet@production] profile::tlsproxy::envoy: add condition to cfss base options

https://gerrit.wikimedia.org/r/1277452

Change #1277449 merged by Elukey:

[operations/puppet@production] Move netbox and presto to the new PKI intermediate

https://gerrit.wikimedia.org/r/1277449

Change #1277507 merged by Elukey:

[operations/puppet@production] profile::puppetdb: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277507

Change #1275956 abandoned by Elukey:

[operations/puppet@production] profile::pki::get_cert: add lookup() to the label argument

https://gerrit.wikimedia.org/r/1275956

@bking the scholar internal wdqs hosts should be fixed now, lemme know if you want to test more!

Change #1277512 merged by Elukey:

[operations/puppet@production] profile::cache::purge: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277512

Change #1277510 merged by Elukey:

[operations/puppet@production] profile::dragonfly: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277510

Change #1277511 merged by Elukey:

[operations/puppet@production] profile::docker_registry: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277511

Change #1277513 merged by Elukey:

[operations/puppet@production] profile::etcd::tlsproxy: move to the new pki intermediate

https://gerrit.wikimedia.org/r/1277513

Change #1277622 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] admin_ng: Move all clusters to the pki discovery2026 intermediate

https://gerrit.wikimedia.org/r/1277622

Change #1277713 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: enable new discovery intermediate certificate

https://gerrit.wikimedia.org/r/1277713

Change #1277713 merged by Bking:

[operations/puppet@production] wdqs: enable new discovery intermediate certificate

https://gerrit.wikimedia.org/r/1277713

@elukey thanks for getting those done! I just merged the above patch to migrate all of WDQS and I'm still having to do the one-off steps I described in P91477 to get envoy to actually serve the new certificate on the production WDQS hosts.

Looking at the history in wdqs2026 it looks like you had to do some one-off steps as well. Is that expected? I (perhaps foolishly) assumed that the Puppet code was working when you said they were done.

Let me know if it is fixed and I just missed something, or if I will still need to do some one-off steps. I'm totally fine either way.

Change #1277622 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Move all clusters to the pki discovery2026 intermediate

https://gerrit.wikimedia.org/r/1277622

Mentioned in SAL (#wikimedia-operations) [2026-04-28T07:24:11Z] <jayme> switching cfss-issuer instances on all clusters to use discovery2026 - T420993

Change #1278256 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch remaining Ganeti clusters to discovery2026 intermediate

https://gerrit.wikimedia.org/r/1278256

Mentioned in SAL (#wikimedia-operations) [2026-04-28T07:55:50Z] <jayme> started renewal of certificates on codfw kubernetes clusters - T420993

Change #1278256 merged by Muehlenhoff:

[operations/puppet@production] Switch remaining Ganeti clusters to discovery2026 intermediate

https://gerrit.wikimedia.org/r/1278256

Mentioned in SAL (#wikimedia-operations) [2026-04-28T08:42:22Z] <jayme> started renewal of certificates on eqiad kubernetes clusters - T420993

Mentioned in SAL (#wikimedia-operations) [2026-04-28T08:44:41Z] <moritzm> migrate Ganeti clusters to the new discovery2026 intermediate, starting for the edges T420993

Mentioned in SAL (#wikimedia-operations) [2026-04-28T09:21:30Z] <moritzm> migrate eqiad/codfw Ganeti clusters to the new discovery2026 intermediate T420993

The renewal of all cert-manager managed certificates (in all k8s clusters) has been completed. All certificates are now issued by discovery2026.

Change #1278391 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] debmonitor: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278391

Change #1278391 merged by Muehlenhoff:

[operations/puppet@production] debmonitor: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278391

Change #1278426 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] apt/staging: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278426

Change #1277508 merged by Bking:

[operations/puppet@production] profile::opensearch::cirrus::server: move to a new pki intermediate

https://gerrit.wikimedia.org/r/1277508

Change #1277509 merged by Ssingh:

[operations/puppet@production] profile::hcaptcha: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277509

Change #1278426 merged by Muehlenhoff:

[operations/puppet@production] apt/staging: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278426

Change #1278491 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] puppetboard: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278491

Change #1278491 merged by Muehlenhoff:

[operations/puppet@production] puppetboard: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1278491

Change #1277506 merged by Jcrespo:

[operations/puppet@production] profile::mediabackup: move to the discovery2026 pki intermediate

https://gerrit.wikimedia.org/r/1277506

Change #1278499 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::crm: move the pki intermediate to discovery2026

https://gerrit.wikimedia.org/r/1278499

I thought my puppet code had a bug caused by a refresh not being enough to reload the tls configuration, but that wasn't the issue, the automatic refresh from puppet was enough; thus the problem I had above was due to the long-running client requests, not the server. Nevertheless, I did a full restart of the service for testing and then verified it all got the discovery2026 cert (checked with openssl for all open ports). All good on my side.

@elukey thanks for getting those done! I just merged the above patch to migrate all of WDQS and I'm still having to do the one-off steps I described in P91477 to get envoy to actually serve the new certificate on the production WDQS hosts.

Looking at the history in wdqs2026 it looks like you had to do some one-off steps as well. Is that expected? I (perhaps foolishly) assumed that the Puppet code was working when you said they were done.

Let me know if it is fixed and I just missed something, or if I will still need to do some one-off steps. I'm totally fine either way.

To keep archives happy - I rolled back https://gerrit.wikimedia.org/r/c/operations/puppet/+/1278480 and from Moritz's tests, everything seems to work fine. The other change, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1277452/, was the only one needed.

Brian is going to force a envoy config rebuild of every wdqs host since puppet cannot fix them. Example: cumin -m async 'wdqs1026*' 'depool' '/usr/local/sbin/build-envoy-config -c /etc/envoy' 'systemctl restart envoyproxy' 'pool'`

Moritz created subtasks for all teams that manage tlsproxy instances, to force the new intermediate to be picked up.

Change #1278499 merged by Elukey:

[operations/puppet@production] role::crm: move the pki intermediate to discovery2026

https://gerrit.wikimedia.org/r/1278499

Change #1278527 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: switch to discovery2026 for envoy

https://gerrit.wikimedia.org/r/1278527

Change #1278546 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] titan: switch to discovery2026 for envoy

https://gerrit.wikimedia.org/r/1278546

Change #1278561 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wdqs: remove references to defunct role wdqs::internal

https://gerrit.wikimedia.org/r/1278561

Change #1278546 merged by Herron:

[operations/puppet@production] titan: switch to discovery2026 for envoy

https://gerrit.wikimedia.org/r/1278546

Change #1278527 merged by Herron:

[operations/puppet@production] prometheus: switch to discovery2026 for envoy

https://gerrit.wikimedia.org/r/1278527

Change #1278595 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus::pop: switch to discovery2026

https://gerrit.wikimedia.org/r/1278595

Change #1278595 merged by Herron:

[operations/puppet@production] prometheus::pop: switch to discovery2026

https://gerrit.wikimedia.org/r/1278595

Change #1278610 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] wcqs: Migrate to new discovery intermediate

https://gerrit.wikimedia.org/r/1278610

Change #1279268 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] etherpad: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279268

Change #1279268 merged by Jelto:

[operations/puppet@production] etherpad: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279268

Change #1279273 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] aphlict: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279273

Change #1279273 merged by Jelto:

[operations/puppet@production] aphlict: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279273

Change #1279274 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] phabricator: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279274

Change #1279274 merged by Jelto:

[operations/puppet@production] phabricator: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279274

Change #1279282 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] peopleweb: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279282

Change #1279282 merged by Jelto:

[operations/puppet@production] peopleweb: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279282

Change #1279287 had a related patch set uploaded (by Jelto; author: Jelto):

[operations/puppet@production] doc: Switch to discovery2026 intermediate for Envoy

https://gerrit.wikimedia.org/r/1279287

Change #1279340 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] tlsproxy::envoy: Bump default now that services have moved

https://gerrit.wikimedia.org/r/1279340

Change #1278610 merged by Bking:

[operations/puppet@production] wcqs: Migrate to new discovery intermediate

https://gerrit.wikimedia.org/r/1278610

I ran a fleet-wide grep tls_cert -A 4 /etc/envoy/envoy.yaml in Cumin on C:profile::envoy and only the new discovery2026 certs and a handful of ACME certs come up

Change #1279347 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::crm: update postfix's cfssl pki intermediate

https://gerrit.wikimedia.org/r/1279347

Change #1279347 merged by Elukey:

[operations/puppet@production] role::crm: update postfix's cfssl pki intermediate

https://gerrit.wikimedia.org/r/1279347

To keep everyone up to date, this morning Moritz asked me to have a look at why debmonitor-client was failing on the two pki hosts (pki1001.eqiad.wmnet,pki2002.codfw.wmnet) with:

SSLError(SSLError(1, '[SSL: SSLV3_ALERT_CERTIFICATE_EXPIRED] sslv3 alert certificate expired (_ssl.c:2636)')): /hosts/pki1001.eqiad.wmnet/update

After debugging it for a bit I discovered that the certificate in /etc/debmonitor/ssl/debmonitor__pki1001_eqiad_wmnet.chained.pem was having a valid leaf certificate and a correct and valid intermediated chained. The problem was that the signer of the leaf certificate (X509v3 Authority Key Identifier) was not matching the intermediate cert identifier (X509v3 Subject Key Identifier) and was most likely the old one.
As this happened only on the two PKI hosts it's possible that either by design or because of the rollout procedure, the way those certificates were generated on the PKI hosts themselves showed this issue.
I've solved the problem moving the .pem and .chained.pem files and running puppet on the two hosts. They now got a leaf with the proper signer ID and debmonitor client is working fine.
I've left the frankenstein certs on disk with .wrong extension if anyone wants to dig more and fully understand the chain of events that generated the wrong "package".

This could possibly also classify as a cfssl (or our way of using it) bug that was able to generate a chained cert where the leaf is not signed by the intermediate chained in there.

Change #1282350 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::pki: remove the 'discovery' intermediate's config

https://gerrit.wikimedia.org/r/1282350

Updated the documentation: https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_an_existing_intermediate

Next steps: