Page MenuHomePhabricator

syslog tls clients failing to connect to centrallog2002 post puppet7 migration
Closed, ResolvedPublic

Description

centrallog2002 was moved to puppet7 and thus new certificates, though end hosts now are failing to validate the new cert, for example:

Nov 14 09:33:26 mw2392 rsyslogd[48013]: not permitted to talk to peer, certificate invalid: signer not found [v8.1901.0]
Nov 14 09:33:26 mw2392 rsyslogd[48013]: invalid cert info: peer provided 3 certificate(s). Certificate 1 info: certificate valid from Sun Nov 12 12:37:08 2023 to Sat Nov 11 12:37:08 2028; Certificate public key: RSA; DN: CN=centrallog2002.codfw.wmnet; Issuer DN: C=US,L=San Francisco,O=Wikimedia Foundation\, Inc,OU=SRE Foundations,CN=puppet_rsa; SAN:DNSname: centrallog2002.codfw.wmnet;  [v8.1901.0]

Event Timeline

Some additional information

  • puppet7 agents can talk to both centrallog1002 and cenrtalog2002 meaning that the puppet7 agents trust both ca's
  • OpenSSL using the same certs from a puppet5 agents also seems to work
  • i preformed a test removing CN = Wikimedia_Internal_Root_CA from the chain file on centrallog2002 and had the same issue
$ sudo openssl s_client -verify_return_error -connect 10.192.16.35:6514 -cert /etc/rsyslog/ssl/cert.pem -key /etc/rsyslog/ssl/server.key -CAfile /etc/ssl/certs/wmf-ca-certificates.crt
CONNECTED(00000003)
Can't use SSL_get_servername
depth=2 C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
verify return:1
depth=1 C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa
verify return:1
depth=0 CN = centrallog2002.codfw.wmnet
verify return:1
---
Certificate chain
 0 s:CN = centrallog2002.codfw.wmnet
   i:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa
 1 s:C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
 2 s:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
   i:C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIGtzCCBJ+gAwIBAgICAYUwDQYJKoZIhvcNAQELBQAweDELMAkGA1UEBhMCVVMx
FjAUBgNVBAcMDVNhbiBGcmFuY2lzY28xIjAgBgNVBAoMGVdpa2ltZWRpYSBGb3Vu
ZGF0aW9uLCBJbmMxGDAWBgNVBAsMD1NSRSBGb3VuZGF0aW9uczETMBEGA1UEAwwK
cHVwcGV0X3JzYTAeFw0yMzExMTIxMjM3MDhaFw0yODExMTExMjM3MDhaMCUxIzAh
BgNVBAMMGmNlbnRyYWxsb2cyMDAyLmNvZGZ3LndtbmV0MIICIjANBgkqhkiG9w0B
AQEFAAOCAg8AMIICCgKCAgEAtBuUHZ74xeW0V/mxz+a1UPUAtm09NzyOL+doGMqw
9JLDcv1DNGomEto8pcOUA/Zeem4QrBF4x2vhVTmjoTmuH3z2dDmcOQpwVfAHhRYr
t9NRZ3VhKsMfhcS1+SASAYgSTxGw4ErzW9pRR6A9z/oGt/znG2NKx5WOB4cLtyM+
iwsUzsQnYKI1dbs7Hdk74+a+pQEMQX7oSfdTTuWyu99r6/ICa/knMDu7JXjsnaMr
MQ1qApFwkVfZWVoSirwE6qZx2inghcRwR49p0/LTBfslSM/jWI0FDiJHosiAdyjQ
gea5cZxnE1pIPwvqQJ1YxlcyVNndbgLt7Gle4zzD/bodAFSSJi/hhem6eNhHY0/f
vckOFjnfak6CFlfbrmpqaPRvqxsmtUY85rHeZBM2ZbSNpMKIe4B78IPAMfKivY1x
PfSKkxv9G3LrBf11pdCOlAiSb+atHplHevTX77RTknJ/qJwKo3aRyZ71LkWMJ0mY
O6SrMS33MfxkDqjlpG452mvAS0EQWhGjtCmLhDY4pErbrFB8nmkbCqBcb2XR7J8g
S9yUcIUR87rKw+furD+dzpS6dnKkvs7VIiaBLzU3PzVo9HBejFqmDwQvBteMLRv+
I9wrg8Ia0rHvn3jPYbfXen5aC6dXX929aAdIT1nYyB0394q0ni3jTKnPbUa9jDFl
eRECAwEAAaOCAZwwggGYMCUGA1UdEQQeMByCGmNlbnRyYWxsb2cyMDAyLmNvZGZ3
LndtbmV0MDEGCWCGSAGG+EIBDQQkFiJQdXBwZXQgU2VydmVyIEludGVybmFsIENl
cnRpZmljYXRlMIHcBgNVHSMEgdQwgdGAFGur9kcEqXw5kZCtAKJSrgZFzzpgoYGi
pIGfMIGcMQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTEWMBQGA1UE
BxMNU2FuIEZyYW5jaXNjbzEiMCAGA1UEChMZV2lraW1lZGlhIEZvdW5kYXRpb24s
IEluYzEXMBUGA1UECxMOQ2xvdWQgU2VydmljZXMxIzAhBgNVBAMMGldpa2ltZWRp
YV9JbnRlcm5hbF9Sb290X0NBghQ5H5kAW2vMspimEMgYxr14Xo7QsjAdBgNVHQ4E
FgQUhkP+ApTk4dfSl0CG6B88iXiDYAowDAYDVR0TAQH/BAIwADAgBgNVHSUBAf8E
FjAUBggrBgEFBQcDAQYIKwYBBQUHAwIwDgYDVR0PAQH/BAQDAgWgMA0GCSqGSIb3
DQEBCwUAA4ICAQCcU2VvIjbfOZuJAr+CpDUADJx7Jj2YGaITOsmALortteHwHF75
Vl6CI8/ViwGgY3HDsrg/VD+uuirfWGtPw2SO94yGX6551lYfFvSsYkD7Fdl20wyr
ZTYiNiFLFZQBDl+A/0QnS6qTvxCQUr1NLXoWbMfTvU6SNQQZyTrJ0rwxn70fAMMS
0BEmKS8XW16kydPvzMUMIlc5hhKtY9cnbATkTxM1Mn0m2vMfK8UO+fNmEtQaH1R6
DjdR4xUG1rZVGE9Mz7QDxphX+LD4njSJ8o05PMt++nimvjT5lVUZPSXzAcxmUvi+
BGdmoSuKS1GAZ7hyPnhP2HfNU9zEskRhg1bqrPJ04U/wLwogwj8CmQ3J/m30X+e9
XlgXMG+2MqBvL0rvNtCeTyXGKOZgPSd4k1GpAKNQPT+vXmWTDiKWtXLuM8GgeWhp
rPE9XmWaF7P+hjLCp6efcuqK1eaawcFV0InMQW9MayCGj+gdD2fnzENOzAcMF48L
ziX5B726J01W1QIAMg34Kh5nUYH5qCsrOKxVBSQK1fdbu9mKZb7BZ2kAGdhjTABL
t754xRQcnrOfej6IaH7mxcC0wT9+yb3F9D5LZ8uHJU/yGXSumClFMtDDPLjdMe+o
fQRD7gzYRfOug9YFjZYBnF4Imp/LNMWGgN3yGlExA0HR3gOFJjiO7pbzVQ==
-----END CERTIFICATE-----
subject=CN = centrallog2002.codfw.wmnet

issuer=C = US, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = SRE Foundations, CN = puppet_rsa

---
Acceptable client certificate CA names
CN = Puppet CA: palladium.eqiad.wmnet
C = US, ST = California, L = San Francisco, O = "Wikimedia Foundation, Inc", OU = Cloud Services, CN = Wikimedia_Internal_Root_CA
Requested Signature Algorithms: RSA+SHA256:RSA-PSS+SHA256:RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA+SHA384:RSA-PSS+SHA384:RSA-PSS+SHA384:ECDSA+SHA384:Ed448:RSA+SHA512:RSA-PSS+SHA512:RSA-PSS+SHA512:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA+SHA256:RSA-PSS+SHA256:RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA+SHA384:RSA-PSS+SHA384:RSA-PSS+SHA384:ECDSA+SHA384:Ed448:RSA+SHA512:RSA-PSS+SHA512:RSA-PSS+SHA512:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 4944 bytes and written 3737 bytes
Verification: OK
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 4096 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)
---

from a very simple test this appears to only affect buster

# in the following everything succeeds (i.e. the message exist) 
$ sudo cumin 'A:buster' "grep -q 'rsyslogd: invalid cert inf' /var/log/syslog" 
# in the following everything fail (i.e. the message does not exist) 
$ sudo cumin 'A:bullseye or A:bookworm' "grep -q 'rsyslogd: invalid cert inf' /var/log/syslog"

Feels like this could be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=887637 (https://github.com/rsyslog/rsyslog/issues/2762) that iss is about the server not sending the intermediate but i wonder if the same issues means the client doesn't read the sent intermediate

edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035

edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035

ok i don't think its this as we still have SSL_set_verify_depth(pThis->ssl, 4); in the buster packages

I can confirm that e.g. bookworm hosts are sending syslog fine, e.g. titan1002:

centrallog2002:~$ tail -5 /srv/syslog/titan1002/syslog.log
Nov 14 11:16:06 titan1002 systemd[1]: prometheus_puppet_agent_stats.service: Consumed 1.603s CPU time.
Nov 14 11:16:14 titan1002 sshd[623763]: Connection from 208.80.154.88 port 40210 on 10.64.48.167 port 22 rdomain ""
Nov 14 11:16:14 titan1002 sshd[623763]: Connection closed by 208.80.154.88 port 40210 [preauth]
Nov 14 11:16:19 titan1002 sshd[623782]: Connection from 208.80.153.84 port 42646 on 10.64.48.167 port 22 rdomain ""
Nov 14 11:16:19 titan1002 sshd[623782]: Connection closed by 208.80.153.84 port 42646 [preauth]

Ditto bullseye:

centrallog2002:~$ tail -5 /srv/syslog/thanos-fe1001/syslog.log
Nov 14 11:18:22 thanos-fe1001 systemd[1]: prometheus-debian-version-textfile.service: Succeeded.
Nov 14 11:18:22 thanos-fe1001 systemd[1]: Finished Update Debian version stat exported by node_exporter.
Nov 14 11:18:39 thanos-fe1001 systemd[1]: Starting Update NIC firmware stats exported by node_exporter...
Nov 14 11:18:39 thanos-fe1001 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded.
Nov 14 11:18:39 thanos-fe1001 systemd[1]: Finished Update NIC firmware stats exported by node_exporter.

edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035

ok i don't think its this as we still have SSL_set_verify_depth(pThis->ssl, 4); in the buster packages

confirming this is not an issue. i rebuild rsyslog without SSL_set_verify_depth and we hit the same issue. trying a backport from bulsseye

jbond triaged this task as High priority.Nov 14 2023, 4:13 PM

Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so that would suggest its not an issue with rsyslog :/. perhaps a different option would be to pressure T347565, however i fear we may hit the same issue

Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so that would suggest its not an issue with rsyslog :/. perhaps a different option would be to pressure T347565, however i fear we may hit the same issue

It's probably rather an issue in GNUTLS? rsyslog 8.2102 has support for OpenSSL (via the rsyslog-openssl package), we could maybe try that?

Well i have updated apt1001 to 8.2102.0-2~deb10u1 and i still see the problem so that would suggest its not an issue with rsyslog :/. perhaps a different option would be to pressure T347565, however i fear we may hit the same issue

It's probably rather an issue in GNUTLS? rsyslog 8.2102 has support for OpenSSL (via the rsyslog-openssl package), we could maybe try that?

Another option is to be bold and simply exempt Buster hosts from central log collection until they are reimaged. Every Buster host is almost half a year behind it's expected migration due date defined in our OS update policy and the additional efforts to make them still work need to be weighed against the usefulness of having the logs in centralllog.

Change 974509 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] apt1001: use ossl for rsyslog

https://gerrit.wikimedia.org/r/974509

Change 974509 merged by Jbond:

[operations/puppet@production] apt1001: use ossl for rsyslog

https://gerrit.wikimedia.org/r/974509

i have tested using openssl and that works so ill prepare a patch to switch all buster to openssl

Change 974520 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] remote_syslog: force rsyslog-openssl on buster

https://gerrit.wikimedia.org/r/974520

Change 974520 merged by Jbond:

[operations/puppet@production] remote_syslog: force rsyslog-openssl on buster

https://gerrit.wikimedia.org/r/974520

jbond claimed this task.

i have rolled out a change so that buster machines use openssl which seems to have fixed the issue. please reopen if you see other problems

Thank you for looking into this and fixing the issue, I can confirm the errors I'm seeing now on centrallog2002 are related to probes from monitoring which is expected

I'll also open a separate task to eventually also move Bullseye and Bookworm hosts to OpenSSL, the less we use GNUTLS the better.

Production migration from the gnutls driver to the openssl driver can be tracked in T324623.