Page MenuHomePhabricator

Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep
Closed, ResolvedPublic

Description

Common information

  • summary: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep
  • alertname: PuppetAgentFailure
  • instance: deployment-cache-upload08
  • job: node
  • project: deployment-prep
  • severity: warning

Firing alerts


  • summary: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep
  • alertname: PuppetAgentFailure
  • instance: deployment-cache-upload08
  • job: node
  • project: deployment-prep
  • severity: warning
  • Source

Event Timeline

bd808 triaged this task as High priority.Mar 5 2026, 11:01 PM
bd808 moved this task from To Triage to Puppet errors on the Beta-Cluster-Infrastructure board.
bd808 subscribed.
bd808@deployment-cache-upload08.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(704df6ff1c) gitpuppet - MediaWiki: Only proxy existing .php files, otherwise return nice 404'
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: 2026/03/05 22:50:44 [INFO] Using client auth with mutual-tls-cert: /etc/cfssl/mutual_tls_client_cert.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud.pem
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: 2026/03/05 22:50:44 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/localcerts/pki_api_CA.pem
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: {"code":7400,"message":"failed POST to https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign: Post \"https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign\": x509: certificate has expired or is not yet valid: current time 2026-03-05T22:50:44Z is after 2026-03-02T11:44:30Z"}
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: Failed to parse input: unexpected end of JSON input
Error: '/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud.pem -label discovery  /etc/purged/ssl/discovery__purged.csr | /usr/bin/cfssljson -bare /etc/purged/ssl/discovery__purged
' returned 1 instead of one of [0]
Error: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /etc/cfssl/mutual_tls_client_cert.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud.pem -label discovery  /etc/purged/ssl/discovery__purged.csr | /usr/bin/cfssljson -bare /etc/purged/ssl/discovery__purged
' returned 1 instead of one of [0] (corrective)
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[create chained cert /etc/purged/ssl/discovery__purged.chain.pem]: Dependency Exec[renew certificate - discovery__purged] has failures: true
Warning: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[create chained cert /etc/purged/ssl/discovery__purged.chain.pem]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/File[/etc/purged/ssl/discovery__purged.chained.pem]: Skipping because of failed dependencies
Warning: /Stage[main]/Purged/Systemd::Service[purged]/Service[purged]: Skipping because of failed dependencies
Notice: Applied catalog in 12.44 seconds

The problem buried in there is:

Notice: /Stage[main]/Main/Cfssl::Cert[discovery__purged]/Exec[renew certificate - discovery__purged]/returns: {"code":7400,"message":"failed POST to https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign: Post \"https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign\": x509: certificate has expired or is not yet valid: current time 2026-03-05T22:50:44Z is after 2026-03-02T11:44:30Z"}

I can't find a Phab task for this, but there is are 2 alerts that have been going off for more than a month for the Puppet CA certificate on pki-pm.pki.eqiad1.wikimedia.cloud expiring. That cert is now expired and in need of replacement apparently? There are a lot of alerts for that project: https://prometheus-alerts.wmcloud.org/?q=%40state%3Dactive&q=project%3Dpki

The PKI admins are: @elukey, @JMeybohm, @Volans, @jbond, and @Ladsgroup.

Very weird:

elukey@pki-intermediate:~$ openssl s_client pki-intermediate.pki.eqiad1.wikimedia.cloud:443 | openssl x509 -text -noout
[..]
        Validity
            Not Before: Feb 10 08:09:05 2026 GMT
            Not After : Feb 10 08:09:05 2031 GMT

So the certificate validity is not related to the CA's certificate.

Ok found the problem:

elukey@pki-intermediate:~$ openssl s_client -connect pki-intermediate.pki.eqiad1.wikimedia.cloud:443 -showcerts
[..]
Certificate chain
 0 s:CN=pki-intermediate.pki.eqiad1.wikimedia.cloud
   i:CN=Puppet CA: pki-pm.pki.eqiad1.wikimedia.cloud
   a:PKEY: RSA, 4096 (bit); sigalg: sha256WithRSAEncryption
   v:NotBefore: Feb 10 08:09:05 2026 GMT; NotAfter: Feb 10 08:09:05 2031 GMT
[..]
 1 s:CN=Puppet CA: pki-pm.pki.eqiad1.wikimedia.cloud
   i:CN=Puppet CA: pki-pm.pki.eqiad1.wikimedia.cloud
   a:PKEY: RSA, 4096 (bit); sigalg: sha256WithRSAEncryption
   v:NotBefore: Mar  2 11:44:30 2021 GMT; NotAfter: Mar  2 11:44:30 2026 GMT
[..]
Server certificate
subject=CN=pki-intermediate.pki.eqiad1.wikimedia.cloud
issuer=CN=Puppet CA: pki-pm.pki.eqiad1.wikimedia.cloud

The certificate authority that signed the cert for pki-intermediate is expired. In theory the work should be https://wikitech.wikimedia.org/wiki/PKI/CA_Operations#Renewing_a_new_intermediate

Ok the issue is:

root@pki-puppetserver-1:/etc/puppet/puppetserver/ca# ls -l
total 36
-rw-r--r-- 1 puppet puppet 1365 Feb 11 08:06 ca_crl.pem
-rw-r--r-- 1 puppet puppet 2033 Mar  3  2021 ca_crt.pem
-rw-r----- 1 puppet puppet 3243 Mar  3  2021 ca_key.pem
-rw-r--r-- 1 puppet puppet  800 Mar  3  2021 ca_pub.pem

The puppetserver on pki-puppetserver-1.pki.eqiad1.wikimedia.cloud has been configured with the old CA used by pki-pm.pki.eqiad1.wikimedia.cloud (old puppet master) to allow puppet to run on the other pki vms without regenerating all certs. I see the new ca cert under /var/lib/puppet/ssl/certs/, but I can only find one ca_key.pem and it seems the old one.

@Andrew Hi! I think that you migrated this project when upgrading it from puppet 5 to puppet 7, is my understanding above correct? If so, do you recall if the new CA key is saved somewhere?

If not, what should I do? I never regenerate the puppet ca's cert, should I clean the old ones and run puppetserver ca setup? Plus of course re-generate and sign all certs running on the project's vms.

@elukey I doubt that I'll be of much help here. If the key isn't present on the puppetserver itself then it is likely lost -- I also don't really understand what's going on with the cross-project certs between pki and deployment-prep.

In almost all cases like this I would just wipe out and regenerate the client certs.

I also don't really understand what's going on with the cross-project certs between pki and deployment-prep.

There are a number of services for Beta Cluster where SRE teams have felt that they would be able to provide better support by leveraging other Cloud VPS projects. The ELK stack being in the logging project is one example. I am relatively certain this is one of those cases where the CFSSL PKI stuff is leveraged across Cloud VPS project boundaries.

@Andrew I asked since from the logs you created the puppetserver vm keeping the old certs in place, so I was wondering if it was part of a specific workflow that you follow for these occasions or not. I'll try to regenerate the puppetserver ca next week :)

I also don't really understand what's going on with the cross-project certs between pki and deployment-prep.

There are a number of services for Beta Cluster where SRE teams have felt that they would be able to provide better support by leveraging other Cloud VPS projects. The ELK stack being in the logging project is one example. I am relatively certain this is one of those cases where the CFSSL PKI stuff is leveraged across Cloud VPS project boundaries.

Exactly yes, this project is used by multiple projects because it was convenient.

Change #1249224 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] cloud: update the PKI's api trusted CA certificate

https://gerrit.wikimedia.org/r/1249224

Change #1249224 merged by Elukey:

[operations/puppet@production] cloud: update the PKI's api trusted CA certificate

https://gerrit.wikimedia.org/r/1249224

elukey claimed this task.
elukey@deployment-cache-upload08:~$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for deployment-cache-upload08.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(4b65c54ac6) gitpuppet - MediaWiki: Only proxy existing .php files, otherwise return nice 404'
Notice: Applied catalog in 13.66 seconds