Page MenuHomePhabricator

beta-scap-sync-world failure — SSL peer certificate or SSH remote key was not OK
Closed, ResolvedPublic

Description

Fatal error: Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki-staging/php-master/includes/config/EtcdConfig.php:229

https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/106736/console

Event Timeline

TheresNoTime renamed this task from beta-scap-sync-world failure to beta-scap-sync-world failure — SSL peer certificate or SSH remote key was not OK.Jun 8 2023, 6:42 PM
TheresNoTime updated the task description. (Show Details)
samtar@deployment-mediawiki12:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud
Info: Applying configuration version '(115bc91d449) root - [LOCAL HACK] scap: foreachwikiindblist: always filter for all-labs'
Notice: /Stage[main]/Profile::Beta::Mediawiki_packages/Package[lilypond/buster-backports]/ensure: created (corrective)
Notice: /Stage[main]/Profile::Beta::Mediawiki_packages/Package[lilypond-data/buster-backports]/ensure: created (corrective)
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/returns: 2023/06/08 19:01:11 [INFO] Using client auth with mutual-tls-cert: /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/returns: 2023/06/08 19:01:11 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/localcerts/pki_api_CA.pem
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/returns: {"code":7400,"message":"failed POST to https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign: Post \"https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign\": remote error: tls: expired certificate"}
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/returns: Failed to parse input: unexpected end of JSON input
Error: '/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -label discovery -profile server /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.csr | /usr/bin/cfssljson -bare /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server
' returned 1 instead of one of [0]
Error: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -label discovery -profile server /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.csr | /usr/bin/cfssljson -bare /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server
' returned 1 instead of one of [0] (corrective)
Notice: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[create chained cert /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.chain.pem]: Dependency Exec[renew certificate - discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server] has failures: true
Warning: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/Exec[create chained cert /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.chain.pem]: Skipping because of failed dependencies
Warning: /Stage[main]/Main/Cfssl::Cert[discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server]/File[/etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.chained.pem]: Skipping because of failed dependencies
Warning: /Stage[main]/Envoyproxy/Systemd::Service[envoyproxy.service]/Service[envoyproxy.service]: Skipping because of failed dependencies
Info: Stage[main]: Unscheduling all events on Stage[main]
Notice: Applied catalog in 13.91 seconds

I have re-disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ to avoid noise while this problem is being investigated.

It does look like that, but there are no merge conflicts on deployment-puppetmaster04 this time (unfortunately)

Trying to re-run cfssl manually shows an expired cert...somewhere

root@deployment-mediawiki12:~# GODEBUG=x509ignoreCN=0 /usr/bin/cfssl sign -config /etc/cfssl/client-cfssl.conf -tls-remote-ca /etc/ssl/localcerts/pki_api_CA.pem -mutual-tls-client-cert /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -mutual-tls-client-key /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -label discovery -profile server /etc/envoy/ssl/discovery__appservers_svc_deployment-prep_eqiad1_wikimedia_cloud_server.csr
2023/06/08 19:27:21 [INFO] Using client auth with mutual-tls-cert: /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem
2023/06/08 19:27:21 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/localcerts/pki_api_CA.pem
{"code":7400,"message":"failed POST to https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign: Post \"https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443/api/v1/cfssl/authsign\": remote error: tls: expired certificate"}

And the cert from pki-intermediate.pki.eqiad1.wikimedia.cloud looks ok:

root@deployment-mediawiki12:~# openssl s_client -connect pki-intermediate.pki.eqiad1.wikimedia.cloud:443 -cert /var/lib/puppet/ssl/certs/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem -key /var/lib/puppet/ssl/private_keys/deployment-mediawiki12.deployment-prep.eqiad1.wikimedia.cloud.pem 2> /dev/null | openssl x509 -noout -dates
notBefore=Mar  4 10:32:10 2021 GMT
notAfter=Mar  4 10:32:10 2026 GMT

And the date on the host is: Thu 08 Jun 2023 07:45:23 PM UTC so...

I wonder what the time is on host pki-intermediate.pki.eqiad1.wikimedia.cloud.

Adding another datapoint:

root@deployment-etcd02:~# hostname -f
deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud

# Check the dates on the client certificate that will be used when
# connecting to pki-intermediate.pki.eqiad1.wikimedia.cloud
root@deployment-etcd02:~# openssl x509 -noout -dates -in /var/lib/puppet/ssl/certs/deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud.pem 
notBefore=Mar  4 13:39:30 2021 GMT
notAfter=Mar  4 13:39:30 2026 GMT
# Good

# Check the dates on the CA cert that we'll use to verify pki-intermediate.pki.eqiad1.wikimedia.cloud
root@deployment-etcd02:~# openssl x509 -in /etc/ssl/localcerts/pki_api_CA.pem -dates -noout
notBefore=Mar  2 11:44:30 2021 GMT
notAfter=Mar  2 11:44:30 2026 GMT
# Good

root@deployment-etcd02:~# curl --cert /var/lib/puppet/ssl/certs/deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud.pem --key /var/lib/puppet/ssl/private_keys/deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud.pem --cacert /etc/ssl/localcerts/pki_api_CA.pem -v https://pki-intermediate.pki.eqiad1.wikimedia.cloud:443
...
*   Trying 172.16.5.134...
* Connected to pki-intermediate.pki.eqiad1.wikimedia.cloud (172.16.5.134) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/localcerts/pki_api_CA.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_CHACHA20_POLY1305_SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=pki-intermediate.pki.eqiad1.wikimedia.cloud
*  start date: Mar  4 10:32:10 2021 GMT
*  expire date: Mar  4 10:32:10 2026 GMT
*  common name: pki-intermediate.pki.eqiad1.wikimedia.cloud (matched)
*  issuer: CN=Puppet CA: pki-pm.pki.eqiad1.wikimedia.cloud
*  SSL certificate verify ok.
> GET / HTTP/1.1
> Host: pki-intermediate.pki.eqiad1.wikimedia.cloud
> User-Agent: curl/7.64.0
> Accept: */*
>
* TLSv1.3 (IN), TLS alert, certificate expired (557):
* OpenSSL SSL_read: error:14094415:SSL routines:ssl3_read_bytes:sslv3 alert certificate expired, errno 0
* Closing connection 0
curl: (56) OpenSSL SSL_read: error:14094415:SSL routines:ssl3_read_bytes:sslv3 alert certificate expired, errno 0

The dates on the cert supplied by pki-intermediate.pki.eqiad1.wikimedia.cloud are good.

The connection fails with a code indicating that the client certificate has expired, which makes me wonder what date pki-intermediate.pki.eqiad1.wikimedia.cloud thinks it is.

The connection fails with a code indicating that the client certificate has expired, which makes me wonder what date pki-intermediate.pki.eqiad1.wikimedia.cloud thinks it is.

$ ssh root@pki-intermediate.pki.eqiad1.wikimedia.cloud
$ date
Thu 08 Jun 2023 09:38:53 PM UTC

Change 928703 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: disable all beta cluster jobs

https://gerrit.wikimedia.org/r/928703

Change 928703 merged by jenkins-bot:

[integration/config@master] jjb: disable all beta cluster jobs

https://gerrit.wikimedia.org/r/928703

hashar triaged this task as Unbreak Now! priority.Jun 9 2023, 6:34 AM
hashar subscribed.

I have re-disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ to avoid noise while this problem is being investigated.

I have disabled all three Jenkins jobs since the beta-update-databases-eqiad one was alarming as well this morning.

Since deployment-prep / Beta-Cluster-Infrastructure is not updated and our developers rely on them to gauge the quality of the MediaWiki deployment, I am marking this task as a blocker for next week train (T337527). As such it is now an Unbreak Now!.

Tagging in some PKI Cloud admins for some help: @elukey @JMeybohm

Adding the few others listed as admins of the PKI service :)

And finally (sidenote: why did this take so long to fail..?) https://en.wikipedia.beta.wmflabs.org is down with the same (expected) error

Reiterating my comment on IRC;
I'm taking wild guesses now — https://wikitech.wikimedia.org/wiki/PKI/Cloud#Authorising_puppet_agents_for_a_specific_project to me suggests that the output of cat $(sudo facter -p puppet_config.localcacert) on say deployment-mediawiki12 should be present in https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/files/pki/cloud/client_auth_CA.pem, but it is not.

Also maybe worth noting from the results of openssl s_client -connect pki-intermediate.pki.eqiad1.wikimedia.cloud:443 -prexit is

CN = Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs

The current puppetmaster is deployment-puppetmaster04

Change 928851 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/puppet@production] cloud pki: add (new) add deployment-prep agents as authorised clients

https://gerrit.wikimedia.org/r/928851

Change 928851 had a related patch set uploaded (by Samtar; author: Samtar):

[operations/puppet@production] cloud pki: add (new) add deployment-prep agents as authorised clients

https://gerrit.wikimedia.org/r/928851

(100% just a guess based off of this guide)

Change 928856 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] deployment-prep: add new puppet ca public cert

https://gerrit.wikimedia.org/r/928856

Change 928856 merged by Jbond:

[operations/puppet@production] deployment-prep: add new puppet ca public cert

https://gerrit.wikimedia.org/r/928856

Change 928851 abandoned by Samtar:

[operations/puppet@production] cloud pki: add (new) add deployment-prep agents as authorised clients

Reason:

Superseded by If211a36e7c9ee61d1c673d34fcee90ff7ac6dce6

https://gerrit.wikimedia.org/r/928851

hi All, the puppet CA certificate for deployment prep expired a few weeks back (T335689). This would have been failing since then, i have now added to new certificate to the pki services so things should be working again.

puppet now runs successfully on deployment-mediawiki12, but https://en.wikipedia.beta.wmflabs.org still shows Uncaught ConfigException: Failed to load configuration from etcd: (curl error: 60) SSL peer certificate or SSH remote key was not OK in /srv/mediawiki/php-master/includes/config/EtcdConfig.php:229 — anything else needing to be done first?

The puppet certs are often reused for client auth in other areas e.g. puppet::expose_agent_certs its possible that some other code has the old CA certificate configured as its trust store a wild guess could be /etc/conftool/ssl ? or do a quick and dirty find / -name ca.pem -exec openssl x509 -in {} -noout -dates -subject (i think the puppet managed ones are normally named ca.pem no idea how many false positives this may produce)

Everything seems to be recovering now (I didn't change anything post-comment) 🤷‍♀️

TheresNoTime lowered the priority of this task from Unbreak Now! to Needs Triage.Jun 9 2023, 3:27 PM

No longer blocking the train, dropping priority — leaving open for follow-up work(?)

Change 928617 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[integration/config@master] Revert "jjb: disable all beta cluster jobs"

https://gerrit.wikimedia.org/r/928617

Change 928617 merged by jenkins-bot:

[integration/config@master] Revert "jjb: disable all beta cluster jobs"

https://gerrit.wikimedia.org/r/928617

bd808 assigned this task to jbond.