Page MenuHomePhabricator

thanos internal TLS failure after puppet 7 update
Closed, ResolvedPublic

Description

Since around 2023-11-16 a number of units are failed on thanos-fe1001 ( swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service,swift_ring_manager.service ). The commonality is that they all end up calling swift-dispersion-report under the hood. In practice, none of the thanos front-ends can now run that command.

That fails because of a TLS error when trying to start up an internal client connect:

urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1123)>

A bit of stracery told me this was after attempting to connect to 10.2.2.54:443 i.e. thanos-swift.svc.eqiad.wmnet.

I can reproduce this on the CLI:

mvernon@thanos-fe1001:~$ openssl s_client -connect 10.2.2.54:443 -showcerts </dev/null 2>/dev/null
CONNECTED(00000003)
---
Certificate chain
 0 s:CN = thanos-fe-combined.discovery.wmnet
   i:CN = Puppet CA: palladium.eqiad.wmnet
-----BEGIN CERTIFICATE-----
MIIExDCCAqygAwIBAgICIwwwDQYJKoZIhvcNAQELBQAwKzEpMCcGA1UEAwwgUHVw
cGV0IENBOiBwYWxsYWRpdW0uZXFpYWQud21uZXQwHhcNMjIwMTMxMTI0MDE2WhcN
MjcwMTMxMTI0MDE2WjAtMSswKQYDVQQDDCJ0aGFub3MtZmUtY29tYmluZWQuZGlz
Y292ZXJ5LndtbmV0MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEc995xll27AS/
0Pp3yr3LBNDcpeGjNVnKj6d0C4ECKUF5B/G2SspiHmmSht2+hvlqjXN3gDTlBPUT
D7H5oDViA6OCAbkwggG1MDcGCWCGSAGG+EIBDQQqDChQdXBwZXQgUnVieS9PcGVu
U1NMIEludGVybmFsIENlcnRpZmljYXRlMIH5BgNVHREEgfEwge6CHHRoYW5vcy1x
dWVyeS5zdmMuZXFpYWQud21uZXSCHHRoYW5vcy1zd2lmdC5kaXNjb3Zlcnkud21u
ZXSCFHRoYW5vcy53aWtpbWVkaWEub3Jnghx0aGFub3Mtc3dpZnQuc3ZjLmNvZGZ3
LndtbmV0ghx0aGFub3MtcXVlcnkuZGlzY292ZXJ5LndtbmV0ghx0aGFub3MtcXVl
cnkuc3ZjLmNvZGZ3LndtbmV0ghx0aGFub3Mtc3dpZnQuc3ZjLmVxaWFkLndtbmV0
giJ0aGFub3MtZmUtY29tYmluZWQuZGlzY292ZXJ5LndtbmV0MAwGA1UdEwEB/wQC
MAAwHQYDVR0OBBYEFL6uRHRZCeIdIwLxbTLm30QtEnG+MB8GA1UdIwQYMBaAFFnk
hjB+Aq8NAKZ07Zr2DheubK66MA4GA1UdDwEB/wQEAwIFoDAgBgNVHSUBAf8EFjAU
BggrBgEFBQcDAQYIKwYBBQUHAwIwDQYJKoZIhvcNAQELBQADggIBADuEHZUy3fhw
J2kYuJY3Rz59EpErd2ePna9fjwfCO2uc2yUDM+yYvYRfMCU6efyWNwHn6PIeszjd
Ax1kRTERTLtepieRj8l3kB3QOFU2wU1H0XldElUZ0UnoRCDEAb3dT9jUHh85LuFi
wZEDo9EUd52Vza9kuPNV3tl/syGV3Dr6NLqQQ3buqsjJSp+p9VHyorjkzWkshMWj
xdT4fZ0EZJ8m50SjKCQT2mzQU8i0gwNEGI0PyfW6od06gvKfnmHfJoXSoWqwXpLj
PlA1FxH816dUiB2jZ3lq0paJL3gtm6IWO1K+8rH2QR4rFl24/PaDntXZ4tOborkC
Fa6fIn/+R/PosPMcglLSN3TVehfgwg8fqb7KgtHIl8Y5KE6MgXYjmlgr2RfhH6wG
Q+nPFtJCNIOwaBqz+htwSV8J6ejzgoDOCEXgwf5nQL1RZFjs/eb3vvzOf/7BK5f+
PfD9SQeOByfnsPu2qWzs+5pdujMxhdSWwwZTAEVdeiSMxPmcaLh+hCw/mUR9Fuvj
McbKx+pbLnuw9AMaAOS1gnFoG6IYTkOqRpRzXx++ywqSDWpiDh/QNjzWjk3a150h
bI+D8jrBQgyX1c9VxJ9fYBL8DRalkd1U74dk6B+Xubi89vshQ1rn5ucVVk4utqIP
Dr8lf5rcHbSBwUnMEHAkV2VOaT7jia3D
-----END CERTIFICATE-----
---
Server certificate
subject=CN = thanos-fe-combined.discovery.wmnet

issuer=CN = Puppet CA: palladium.eqiad.wmnet

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1528 bytes and written 363 bytes
Verification error: unable to verify the first certificate
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 21 (unable to verify the first certificate)
---

If I run the same command on cumin1001 (puppet 5), it works:

mvernon@cumin1001:~$ openssl s_client -connect 10.2.2.54:443 -showcerts </dev/null 2>/dev/null
CONNECTED(00000003)
---
Certificate chain
 0 s:CN = thanos-fe-combined.discovery.wmnet
   i:CN = Puppet CA: palladium.eqiad.wmnet
-----BEGIN CERTIFICATE-----
MIIExDCCAqygAwIBAgICIwwwDQYJKoZIhvcNAQELBQAwKzEpMCcGA1UEAwwgUHVw
cGV0IENBOiBwYWxsYWRpdW0uZXFpYWQud21uZXQwHhcNMjIwMTMxMTI0MDE2WhcN
MjcwMTMxMTI0MDE2WjAtMSswKQYDVQQDDCJ0aGFub3MtZmUtY29tYmluZWQuZGlz
Y292ZXJ5LndtbmV0MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEc995xll27AS/
0Pp3yr3LBNDcpeGjNVnKj6d0C4ECKUF5B/G2SspiHmmSht2+hvlqjXN3gDTlBPUT
D7H5oDViA6OCAbkwggG1MDcGCWCGSAGG+EIBDQQqDChQdXBwZXQgUnVieS9PcGVu
U1NMIEludGVybmFsIENlcnRpZmljYXRlMIH5BgNVHREEgfEwge6CHHRoYW5vcy1x
dWVyeS5zdmMuZXFpYWQud21uZXSCHHRoYW5vcy1zd2lmdC5kaXNjb3Zlcnkud21u
ZXSCFHRoYW5vcy53aWtpbWVkaWEub3Jnghx0aGFub3Mtc3dpZnQuc3ZjLmNvZGZ3
LndtbmV0ghx0aGFub3MtcXVlcnkuZGlzY292ZXJ5LndtbmV0ghx0aGFub3MtcXVl
cnkuc3ZjLmNvZGZ3LndtbmV0ghx0aGFub3Mtc3dpZnQuc3ZjLmVxaWFkLndtbmV0
giJ0aGFub3MtZmUtY29tYmluZWQuZGlzY292ZXJ5LndtbmV0MAwGA1UdEwEB/wQC
MAAwHQYDVR0OBBYEFL6uRHRZCeIdIwLxbTLm30QtEnG+MB8GA1UdIwQYMBaAFFnk
hjB+Aq8NAKZ07Zr2DheubK66MA4GA1UdDwEB/wQEAwIFoDAgBgNVHSUBAf8EFjAU
BggrBgEFBQcDAQYIKwYBBQUHAwIwDQYJKoZIhvcNAQELBQADggIBADuEHZUy3fhw
J2kYuJY3Rz59EpErd2ePna9fjwfCO2uc2yUDM+yYvYRfMCU6efyWNwHn6PIeszjd
Ax1kRTERTLtepieRj8l3kB3QOFU2wU1H0XldElUZ0UnoRCDEAb3dT9jUHh85LuFi
wZEDo9EUd52Vza9kuPNV3tl/syGV3Dr6NLqQQ3buqsjJSp+p9VHyorjkzWkshMWj
xdT4fZ0EZJ8m50SjKCQT2mzQU8i0gwNEGI0PyfW6od06gvKfnmHfJoXSoWqwXpLj
PlA1FxH816dUiB2jZ3lq0paJL3gtm6IWO1K+8rH2QR4rFl24/PaDntXZ4tOborkC
Fa6fIn/+R/PosPMcglLSN3TVehfgwg8fqb7KgtHIl8Y5KE6MgXYjmlgr2RfhH6wG
Q+nPFtJCNIOwaBqz+htwSV8J6ejzgoDOCEXgwf5nQL1RZFjs/eb3vvzOf/7BK5f+
PfD9SQeOByfnsPu2qWzs+5pdujMxhdSWwwZTAEVdeiSMxPmcaLh+hCw/mUR9Fuvj
McbKx+pbLnuw9AMaAOS1gnFoG6IYTkOqRpRzXx++ywqSDWpiDh/QNjzWjk3a150h
bI+D8jrBQgyX1c9VxJ9fYBL8DRalkd1U74dk6B+Xubi89vshQ1rn5ucVVk4utqIP
Dr8lf5rcHbSBwUnMEHAkV2VOaT7jia3D
-----END CERTIFICATE-----
---
Server certificate
subject=CN = thanos-fe-combined.discovery.wmnet

issuer=CN = Puppet CA: palladium.eqiad.wmnet

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1529 bytes and written 363 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 256 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

If I run the same command on cumin2002 (a puppet 7 host), I see the same error as on the thanos frontends; that combined with the timing leads me to conclude this is a puppet 7 issue...

Event Timeline

(priority set to high as we do use the swift-dispersion-stats to check for cluster health)

[it was suggested I added jbond to this task]

@MatthewVernon this is almost certainly something using the the puppet ca directly instead of using /etc/ssl/certs/wmf-ca-certificates.crt. I need to investigate a bit more why openssl is failing. specifically

why dose this work:

openssl s_client -connect 10.2.2.54:443 -showcerts -CAfile /etc/ssl/certs/ca-certificates.crt </dev/null 2>/dev/null

but this dose not:

openssl s_client -connect 10.2.2.54:443 -showcerts -CApath /etc/ssl/certs </dev/null 2>/dev/null

I think this likely relates to the fact that i dont think the palladium certificate exists as a separate file in the later dir on pouppet7 hosts. i can lookto this further tomorrow

Change 975869 had a related patch set uploaded (by Jbond; author: jbond):

[operations/debs/wmf-certificates@main] Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem

https://gerrit.wikimedia.org/r/975869

Change 975869 merged by Jbond:

[operations/debs/wmf-certificates@main] Puppet_Internal_CA.pem: rename to Puppet5_Internal_CA.pem

https://gerrit.wikimedia.org/r/975869

I have rolled out a new wmf-certificates package which i believe has fixed this error. all swift services on thanos-fe1001 are now started. tentatively closing but please reopen if i missed something

@jbond thanks, that CR has fixed the sad services (and the openssl runes now work too).