Investigate PKI errors
After migrating the pki infrastructure to puppet7 we started to see issues with certificate renewal. It seems the main error seen was

2023/10/31 10:10:37 [INFO] generate received request
2023/10/31 10:10:37 [INFO] received CSR
2023/10/31 10:10:37 [INFO] generating key: ecdsa-256
2023/10/31 10:10:37 [INFO] encoded CSR
2023/10/31 10:10:37 [INFO] Using client auth with mutual-tls-cert: /etc/cfssl/mutual_tls_client_cert.pem and mutual-tls-key: /var/lib/puppet/ssl/private_keys/pki2002.codfw.wmnet.pem
2023/10/31 10:10:37 [INFO] Using trusted CA from tls-remote-ca: /etc/ssl/certs/wmf-ca-certificates.crt
{"code":7400,"message":"failed POST to https://pki.discovery.wmnet:443/api/v1/cfssl/authsign: Post \"https://pki.discovery.wmnet:443/api/v1/cfssl/authsign\": x509: issuer name does not match subject from issuing certificate"}

which is caused by the strict processing in go and the lax ssl implementation in puppet. correction: it seems its actully go that is in the wrong here

It was also noticed that the ocsp refresh process was failing with

ERROR:root:debmonitor issue with SQL query: (2003, "Can't connect  to MySQL server on 'm1-master.eqiad.wmnet' ([SSL:CERTIFICATE_VERIFY_FAILED] certificate veri>

which was fixed by updating the ca trust bundle

To fix the issue i have now de-pooled pki2002 which is still using puppet7 so we can debug and rolled back pki1001 to puppet5

The first issues in codfw seems to have occurred at 00:05:33 and in eqiad at 00:23:53

The last successful sign in eqiad was at:

Oct 30 19:15:28 pki1001 multirootca[2965215]: 2023/10/30 19:15:28 [INFO] signed certificate with serial number 21618685619994382036873124855151197376736024896
Oct 30 19:15:28 pki1001 multirootca[2965215]: 2023/10/30 19:15:28 [INFO] signature: requester=, label=debmonitor, profile=, serialno=21618685619994382036873124855151197376736024896

in codfw at 2023-10-30T23:04:02.

Oct 30 23:28:00 pki2002 multirootca[2684169]: 2023/10/30 23:28:00 [INFO] signed certificate with serial number 658557859282894506935032635260855593420734825359
Oct 30 23:28:00 pki2002 multirootca[2684169]: 2023/10/30 23:28:00 [INFO] signature: requester=,label=discovery,profile=k8s_staging,serialno=658557859282894506935032635260855593420734825359

the pki systems where migrated at 17:40

It seems apache reloads at 00:00 every night. i believe this is what caused the issue. the pki certificates where rotated to puppet7 at 17:40 however Apache didn't restart and start using the new certificate on the front end until 00L00

jbond changed the task status from Open to In Progress.Oct 31 2023, 10:34 AM
jbond triaged this task as Medium priority.

Change 970338 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki::multirootca: Add puppet_rsa to multirootca

Change 970338 merged by Jbond:

[operations/puppet@production] pki::multirootca: Add puppet_rsa to multirootca

Change 970339 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] pki::multirootca: Add parameter so pki can generate its certs

Change 970339 merged by Jbond:

[operations/puppet@production] pki::multirootca: Add parameter so pki can generate its certs

Change 970369 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] cfssl::ocsp: use client mtls certs if present

Change 970369 merged by Jbond:

[operations/puppet@production] cfssl::ocsp: use client mtls certs if present

jbond claimed this task.

This is fixed now