Page MenuHomePhabricator

Investigate PKI errors
Closed, ResolvedPublic

Description

After migrating the pki infrastructure to puppet7 we started to see issues with certificate renewal. It seems the main error seen was

2023/10/31 10:10:37 [INFO] generate received request
2023/10/31 10:10:37 [INFO] received CSR
2023/10/31 10:10:37 [INFO] generating key: ecdsa-256
2023/10/31 10:10:37 [INFO] encoded CSR
2023/10/31 10:10:37 [INFO] Using client auth with mutual-tls-cert: /etc/cfssl/mutual_tls_client_cert.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/pki2002.codfw.wmnet.pem
2023/10/31 10:10:37 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/certs/wmf-ca-certificates.crt
{"code":7400,"message":"failed POST to https://pki.discovery.wmnet:443/api/v1/cfssl/authsign: Post \"https://pki.discovery.wmnet:443/api/v1/cfssl/authsign\": x509: issuer name does not match subject from issuing certificate"}

which is caused by the strict processing in go and the lax ssl implementation in puppet. correction: it seems its actully go that is in the wrong here

It was also noticed that the ocsp refresh process was failing with

ERROR:root:debmonitor issue with SQL query: (2003, "Can't connect  to MySQL server on 'm1-master.eqiad.wmnet' ([SSL:CERTIFICATE_VERIFY_FAILED] certificate veri>

which was fixed by updating the ca trust bundle

To fix the issue i have now de-pooled pki2002 which is still using puppet7 so we can debug and rolled back pki1001 to puppet5

The first issues in codfw seems to have occurred at 00:05:33 and in eqiad at 00:23:53

Event Timeline

The last successful sign in eqiad was at:

Oct 30 19:15:28 pki1001 multirootca[2965215]: 2023/10/30 19:15:28 [INFO] signed certificate with serial number 21618685619994382036873124855151197376736024896
Oct 30 19:15:28 pki1001 multirootca[2965215]: 2023/10/30 19:15:28 [INFO] signature: requester=127.0.0.1:52186, label=debmonitor, profile=, serialno=21618685619994382036873124855151197376736024896

in codfw at 2023-10-30T23:04:02.

Oct 30 23:28:00 pki2002 multirootca[2684169]: 2023/10/30 23:28:00 [INFO] signed certificate with serial number 658557859282894506935032635260855593420734825359
Oct 30 23:28:00 pki2002 multirootca[2684169]: 2023/10/30 23:28:00 [INFO] signature: requester=127.0.0.1:45780,label=discovery,profile=k8s_staging,serialno=658557859282894506935032635260855593420734825359

the pki systems where migrated at 17:40

It seems apache reloads at 00:00 every night. i believe this is what caused the issue. the pki certificates where rotated to puppet7 at 17:40 however Apache didn't restart and start using the new certificate on the front end until 00L00

jbond changed the task status from Open to In Progress.Oct 31 2023, 10:34 AM
jbond triaged this task as Medium priority.

Change 970338 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki::multirootca: Add puppet_rsa to multirootca

https://gerrit.wikimedia.org/r/970338

Change 970338 merged by Jbond:

[operations/puppet@production] pki::multirootca: Add puppet_rsa to multirootca

https://gerrit.wikimedia.org/r/970338

Change 970339 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki::multirootca: Add parameter so pki can generate its certs

https://gerrit.wikimedia.org/r/970339

Change 970339 merged by Jbond:

[operations/puppet@production] pki::multirootca: Add parameter so pki can generate its certs

https://gerrit.wikimedia.org/r/970339

Change 970369 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] cfssl::ocsp: use client mtls certs if present

https://gerrit.wikimedia.org/r/970369

Change 970369 merged by Jbond:

[operations/puppet@production] cfssl::ocsp: use client mtls certs if present

https://gerrit.wikimedia.org/r/970369

jbond claimed this task.

This is fixed now