Page MenuHomePhabricator

Switch rsyslog from gtls to ossl
Closed, ResolvedPublic

Description

I've been working with @Southparkfan to support central logging on cloud-vps. We can't rely on puppet certs for TLS there since different VMs use different puppetmasters, so I've started to set up acme-chief certs for TLS from the client to the rsyslog receiver.

That is currently blocked by upstream bug https://github.com/rsyslog/rsyslog/issues/2547 which means that rsyslog can't use our acme-chief certs.

A valid fix seems to be to replace use of gtls with ossl. We can either do that fleet-wide (production and cloud-vps) or restrict the change to cloud-vps is there's a reason to avoid the change in production.

Event Timeline

...unless ossl isn't available on Buster in which case this is futile

Background: for T127717, we went with Let's Encrypt certificates. Unlike the rather simple chain of trust for the Puppet CA (leaf certificate -> root certificate (Puppet CA)), Let's Encrypt certificates have an intermediate certificate in between. 'Because TLS' (certificates are terrible, I know), the clients need to receive all certificates but the root certificate (because that is in /etc/ssl/certs/ca-certificates.crt).

For our acme-chief experiment, we set CAfile to ec-prime256v1.chained.crt (leaf + intermediate) and certfile to ec-prime256v1.crt (leaf). For the nitpickers among us: ignore the mismatch between -connect and the CN, this is not relevant yet.

Using gtls (gnutls) driver (broken):

root@syslog-server02:/srv/syslog/172.16.3.90# openssl s_client -connect syslog-serveraudit01.cloudinfra.eqiad1.wikimedia.cloud:6514 -CAfile /etc/ssl/certs/ca-certificates.crt -showcerts
CONNECTED(00000003)
depth=0 CN = syslogaudit.svc.eqiad1.wikimedia.cloud
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 CN = syslogaudit.svc.eqiad1.wikimedia.cloud
verify error:num=21:unable to verify the first certificate
verify return:1
depth=0 CN = syslogaudit.svc.eqiad1.wikimedia.cloud
verify return:1
---
Certificate chain
 0 s:CN = syslogaudit.svc.eqiad1.wikimedia.cloud
   i:C = US, O = Let's Encrypt, CN = R3
[cut]
---
Server certificate
subject=CN = syslogaudit.svc.eqiad1.wikimedia.cloud

issuer=C = US, O = Let's Encrypt, CN = R3

---
Acceptable client certificate CA names
CN = syslogaudit.svc.eqiad1.wikimedia.cloud
C = US, O = Let's Encrypt, CN = R3
C = US, O = Internet Security Research Group, CN = ISRG Root X1
Client Certificate Types: RSA sign, DSA sign, ECDSA sign
Requested Signature Algorithms: RSA+SHA256:RSA-PSS+SHA256:RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA+SHA384:RSA-PSS+SHA384:RSA-PSS+SHA384:ECDSA+SHA384:Ed448:RSA+SHA512:RSA-PSS+SHA512:RSA-PSS+SHA512:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1
Shared Requested Signature Algorithms: RSA+SHA256:RSA-PSS+SHA256:RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA+SHA384:RSA-PSS+SHA384:RSA-PSS+SHA384:ECDSA+SHA384:Ed448:RSA+SHA512:RSA-PSS+SHA512:RSA-PSS+SHA512:ECDSA+SHA512
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1776 bytes and written 452 bytes
Verification error: unable to verify the first certificate

Only the leaf certificate (for syslogaudit.svc.eqiad1.wikimedia.cloud) is sent. The R3 certificate (intermediate) is missing, however.

Using ossl (openssl) driver (OK):

root@syslog-server02:/srv/syslog/172.16.3.90# openssl s_client -connect syslog-server-audit01.cloudinfra.eqiad1.wikimedia.cloud:6514 -CAfile /etc/ssl/certs/ca-certificates.crt -showcerts
CONNECTED(00000003)
depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = R3
verify return:1
depth=0 CN = syslogaudit.svc.eqiad1.wikimedia.cloud
verify return:1
---
Certificate chain
 0 s:CN = syslogaudit.svc.eqiad1.wikimedia.cloud
   i:C = US, O = Let's Encrypt, CN = R3
[cut]
 1 s:C = US, O = Let's Encrypt, CN = R3
   i:C = US, O = Internet Security Research Group, CN = ISRG Root X1
[cut]
 2 s:C = US, O = Internet Security Research Group, CN = ISRG Root X1
   i:O = Digital Signature Trust Co., CN = DST Root CA X3
[cut]
---
Server certificate
subject=CN = syslogaudit.svc.eqiad1.wikimedia.cloud

issuer=C = US, O = Let's Encrypt, CN = R3

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: ECDSA
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 4313 bytes and written 427 bytes
Verification: OK

As I mentioned earlier, because the Puppet CA (used in Wikimedia production, instead of relying on Let's Encrypt) issues leaf certificates signed directly by the root certificate, the gtls driver works fine there. If production decides to switch to the cfssl PKI, this will break too. Actually, using any certificate authority that uses intermediate certificates will cause issues.

Unfortunately, rsyslog-openssl is only available in bullseye and later. Furthermore, because rsyslog is used on the client-side (syslog clients), in order to add support for using ths ossl driver for sending syslog messages to syslog servers, rsyslog-openssl must be installed on both(!) syslog servers and syslog clients. Given that not all syslog clients run on Debian Bullseye or later, I'm wondering if we can backport rsyslog-openssl to buster. Stretch may need a backport too, if that is still in use...

I can prepare a build of an openssl-linked rsyslog for buster, then we install this via a component/rsyslog-ossl on all Buster Cloud VPS hosts? We're looking to retire Buster in production by September 2023 (https://wikitech.wikimedia.org/wiki/Operating_system_upgrade_policy#Current_effective_sunset_dates_for_each_distro_(rounded_to_quarters)) so I'd rather avoid a migration within production.

I'm in general favor of switching to openssl for rsyslog (and thank you for the deep dive investigation!), since in production things are working I'm more inclined to wait out for Buster to be phased out and then switch. HTH!

Change 865602 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add component/rsyslog-openssl for Buster

https://gerrit.wikimedia.org/r/865602

Change 865602 merged by Muehlenhoff:

[operations/puppet@production] Add component/rsyslog-openssl for Buster

https://gerrit.wikimedia.org/r/865602

Mentioned in SAL (#wikimedia-operations) [2022-12-07T11:57:51Z] <moritzm> imported librelp 1.10.0-1~buster1 to component/rsyslog-openssl T324623

Mentioned in SAL (#wikimedia-operations) [2022-12-07T11:59:27Z] <moritzm> imported rsyslog 8.2102.0-2+deb11u1~buster1 to component/rsyslog-openssl T324623

This turned out to a little more complicated than initially assumed. I've now built a backport of the version that is in Bullseye (which is also useful since in the case of a major security issue in rsyslog is allows us to respin the Buster build easily) which offers the rsyslog-openssl binary package. This also needed an updated librelp, which is also included in the component.

@Southparkfan, @Andrew : I'd say please test that the packages work for you and then we can add the component in profile::base::labs for Buster VMs.

Wow, instant gratification! Thank you @MoritzMuehlenhoff, I will test.

Change 865731 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] rsyslog: add support for openssl netstream driver

https://gerrit.wikimedia.org/r/865731

I have tested https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731 by using rsyslog-openssl on one syslog client and one syslog server running buster + one syslog client and one syslog server running bullseye. All works as expected.

Change 865731 merged by Andrew Bogott:

[operations/puppet@production] rsyslog: add support for openssl netstream driver

https://gerrit.wikimedia.org/r/865731

Change 868148 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/puppet@production] rsyslog: use ensure_resource for package_from_component

https://gerrit.wikimedia.org/r/868148

Change 868148 merged by Jbond:

[operations/puppet@production] rsyslog: use ensure_resource for package_from_component

https://gerrit.wikimedia.org/r/868148

akosiaris subscribed.

Remove Sustainability (Incident Followup) since I fail to find an action item that fits this projects description Action items that came out of the investigation and documentation for past Wikimedia production incidents. These action items reduce risk, shorten/reduce impact, or help prevent incidents in the future.

As part of the Puppet migration we already switched all Buster clients (where version of GNUTLS had problems with the new cert) towards OpenSSL, more details in the task Southparkfan linked.

Change 975791 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] base: switch rsyslog tls_netstream_driver to ossl

https://gerrit.wikimedia.org/r/975791

Reading the task it seems like the last blocker was to "wait out buster" (T324623#8449852). however as we have now deployed this to buster (T324623#9334403) it seems like we can move ahead. Are the any concerns to making this change, it seems fairly simple

Change 975861 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] centrallog: update tls_netstream_driver to use ossl

https://gerrit.wikimedia.org/r/975861

Change 975791 merged by Jbond:

[operations/puppet@production] base: switch rsyslog tls_netstream_driver to ossl

https://gerrit.wikimedia.org/r/975791

Change 975861 merged by Jbond:

[operations/puppet@production] centrallog: update tls_netstream_driver to use ossl

https://gerrit.wikimedia.org/r/975861

Change 976190 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] centralserver: remove tls_remedy

https://gerrit.wikimedia.org/r/976190

Change 976190 merged by Filippo Giunchedi:

[operations/puppet@production] centralserver: remove tls_remedy

https://gerrit.wikimedia.org/r/976190

jbond claimed this task.

All systems hav now been migrated to ossl