Page MenuHomePhabricator

acme-chief ldap certs required chained (with intermediate CA) versions suddenly
Closed, ResolvedPublic

Description

This morning ldap tls started failing. The failure coincided with new certificates being created for our ldap servers.

This appears to be the issue:

(old cert)

cat /etc/acmecerts/ldap/cae12c858fa6417d8d999bfaef1c25ec/rsa-2048.crt | openssl x509 -text | grep CN
        Issuer: C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
        Subject: CN = ldap-labs.eqiad.wikimedia.org

(new cert)

cat /etc/acmecerts/ldap/b547061e1e5343eaa1adfcb7de0d6ea7/rsa-2048.crt | openssl x509 -text | grep CN
        Issuer: C = US, O = Let's Encrypt, CN = R3
        Subject: CN = ldap-labs.eqiad.wikimedia.org

I have temporarily hacked the old certs back in place and disabled puppet on the following hosts:

seaborgium.wikimedia.org
serpens.wikimedia.org
ldap-replica100[1-2].wikimedia.org
ldap-replica200[3-4].wikimedia.org

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2021-01-03T15:30:18Z] <andrewbogott> disabling puppet fleet-wide to avert potential disaster from acme-chief cert rotation T271063

Note that during this outage, I was also unable to log in to the icinga web UI, and users reported gerrit issues. So there were several systems that failed to recognize the new certs in addition to slapd and ldap clients.

Some more semi-random info:

Change 653871 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] openldap: use chained certificate for slapd service

https://gerrit.wikimedia.org/r/653871

Change 653871 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openldap: use chained certificate for slapd service

https://gerrit.wikimedia.org/r/653871

Bstorm renamed this task from acme-chief just generated invalid ldap certs to acme-chief ldap certs required chained (with intermediate CA) versions suddenly.Jan 3 2021, 4:18 PM

acme-chief generated a valid certificate, the main difference between the current and the previous one is the intermediate CA that issued the cert:

root@acmechief1001:/var/lib/acme-chief/certs/ldap# openssl x509 -dates -issuer -noout -in cae12c858fa6417d8d999bfaef1c25ec/rsa-2048.crt
notBefore=Nov  4 13:00:48 2020 GMT
notAfter=Feb  2 13:00:48 2021 GMT
issuer=C = US, O = Let's Encrypt, CN = Let's Encrypt Authority X3
root@acmechief1001:/var/lib/acme-chief/certs/ldap# openssl x509 -dates -issuer -noout -in live/rsa-2048.crt
notBefore=Jan  3 13:00:28 2021 GMT
notAfter=Apr  3 13:00:28 2021 GMT
issuer=C = US, O = Let's Encrypt, CN = R3

for some reason slapd wasn't configured to send the intermediate CA along the server CA and that began to cause certificate validation issues with the new CA (C = US, O = Let's Encrypt, CN = R3)

Bstorm triaged this task as Medium priority.Jan 4 2021, 10:43 PM
Bstorm subscribed.

I think this is resolved now. I'll leave it open for a little while for people to disagree.

taavi assigned this task to Andrew.

Nobody has disagreed. Boldly closing.